## Introduction

In this notebook we analize the city of **Rome(Italy)** for trying to find the best places for opening a new **Pizza Shop**. <br>As suggested by many resources about this subject, three of the most important things for choosing a new restaurant location are:

1. visibility: a lot of people pass near that place
2. accessibility: the area is easy to access 
3. competitors: it is needed to be not too near another Pizza Shop

So, considering the Rome neighborhoods informations we can retrieve from online sources, it is needed to discover insights for answering the following question:

<i>**where is it possible to find a place in Rome that it is at the same time easy to access, with a lot of visitors and where there is not too much competitions for a Pizza Shop?**</i>

To simplify the analysis we will consider **tube stops** only as places that **realize the points 1 and 2 above** (<u>we won't consider the difference of number of passengers per tube stop</u>). So the question to answer to is:

<i>**where is it possible to find a place in Rome that it is near a tube stop and where there is not too much competitions for a Pizza Shop?**</i>

## Data

The **Foursquare API** will be used to explore neighborhoods in Rome. The **explore** function to get the most common venue categories in each neighborhood, and then it will be used this feature to group the neighborhoods into clusters. 
<p>The <b>k-means clustering algorithm</b> will let to create clusters of neighborhoods and the <b>Folium</b> library to visualize the neighborhoods their clusters.<p>
    
<p>Neighborhoods information will be collected from <b>wikipedia</b>:</p>

* https://it.wikipedia.org/wiki/Quartieri_di_Roma

<p>Tube stops informations will be collected from <b>wikipedia</b>:</p>

* https://en.wikipedia.org/wiki/List_of_Rome_Metro_stations

The above pages scraping with **BeautifulSoup** library, and **geopy** library for converting an address into latitude and longitude, will provide the informations needed for using Foursquare API.
<p>After creating the clusters of neighborhoods we will be able to get the needed insights.

## Methodology

* ### Retrieve informations and create and explore Dataset

In [2]:
#install the components required for web pages scraping
print("INSTALLING Libraries required for Web Scraping...")
!pip install beautifulsoup4
!pip install lxml
!pip install html5lib
!pip install requests
print("INSTALLING Libraries required for Web Scraping. DONE.") 

INSTALLING Libraries required for Web Scraping...
INSTALLING Libraries required for Web Scraping. DONE.


In [3]:
#import the components required for web pages scraping and dataframe creation
from bs4 import BeautifulSoup
import requests
import pandas as pd

In [4]:
#to force memory release
import gc

In [6]:
################################################
#get the Rome Neighborhoods Names and Locations#
################################################

#get the Rome Neighborhoods source web page
source_url = 'https://it.wikipedia.org/wiki/Quartieri_di_Roma'
source = requests.get(source_url).text

#load it in the scraping component
soup = BeautifulSoup(source, 'lxml')

#get the html element containing the needed informations
data_table = soup.find('div', class_='colonne_strette')
#print(data_table.ul.prettify())

#creating empty dataframe
columns = ['Neighborhood', 'Latitude', 'Longitude']
neighborhoods_df = pd.DataFrame(columns = columns)

#get a list of the needed informations from the parsed html element
data_table_rows = data_table.find_all('a')
#print(data_table_rows)

#fill in the dataframe
print("***Filling in Rome Neighborhoods dataframe...***")
for data_table_row in data_table_rows[1:]:
    neighborhood = data_table_row.text
    link_with_coords = "{}{}".format("https://it.wikipedia.org", data_table_row.get('href'))
    #print("{} {}".format(neighborhood, link_with_coords))    
    #print("retrieving {} coords...".format(neighborhood))
    
    #scraping on the linked page to find neighborhood coords
    source_url = link_with_coords
    source = requests.get(source_url).text
    soup = BeautifulSoup(source, 'lxml')
    data_coords = soup.find(id='coordinates').find(class_='geo').text
    print("{}: retrieving coords. DONE.".format(neighborhood))
    
    del soup #this line and the one below have been added for freezing issues
    gc.collect()
    
    dataframe_row = {'Neighborhood': neighborhood.strip(), 'Latitude': data_coords.split(';')[0], 'Longitude': data_coords.split(';')[1]}
    neighborhoods_df = neighborhoods_df.append(dataframe_row, ignore_index = True)
print("***Filling in Rome Neighborhoods dataframe. DONE.***")

neighborhoods_df

***Filling in Rome Neighborhoods dataframe...***
Parioli: retrieving coords. DONE.
Pinciano: retrieving coords. DONE.
Salario: retrieving coords. DONE.
Nomentano: retrieving coords. DONE.
Tiburtino: retrieving coords. DONE.
Prenestino-Labicano: retrieving coords. DONE.
Tuscolano: retrieving coords. DONE.
Appio-Latino: retrieving coords. DONE.
Ostiense: retrieving coords. DONE.
Portuense: retrieving coords. DONE.
Gianicolense: retrieving coords. DONE.
Aurelio: retrieving coords. DONE.
Trionfale: retrieving coords. DONE.
Della Vittoria: retrieving coords. DONE.
Monte Sacro: retrieving coords. DONE.
Trieste: retrieving coords. DONE.
Tor di Quinto: retrieving coords. DONE.
Prenestino-Centocelle: retrieving coords. DONE.
Ardeatino: retrieving coords. DONE.
Pietralata: retrieving coords. DONE.
Collatino: retrieving coords. DONE.
Alessandrino: retrieving coords. DONE.
Don Bosco: retrieving coords. DONE.
Appio Claudio: retrieving coords. DONE.
Appio-Pignatelli: retrieving coords. DONE.
Primava

Unnamed: 0,Neighborhood,Latitude,Longitude
0,Parioli,41.92237,12.49237
1,Pinciano,41.912846,12.489208
2,Salario,41.913929,12.500492
3,Nomentano,41.915061,12.517968
4,Tiburtino,41.901092,12.526932
5,Prenestino-Labicano,41.8852,12.5381
6,Tuscolano,41.869038,12.536473
7,Appio-Latino,41.8734,12.5163
8,Ostiense,41.863651,12.478911
9,Portuense,41.855239,12.452901


In [7]:
neighborhoods_df.dtypes

Neighborhood    object
Latitude        object
Longitude       object
dtype: object

In [8]:
#Converting Latitude and Longitude columns type to a numerical one 
#needed by Folium visualization library
cols = ['Latitude', 'Longitude']
for col in cols:  # Iterate over chosen columns
    neighborhoods_df[col] = pd.to_numeric(neighborhoods_df[col])
    
neighborhoods_df.dtypes

Neighborhood     object
Latitude        float64
Longitude       float64
dtype: object

In [9]:
################################################
#get the Rome Tube Stops Names and Locations#
################################################

#get the Rome Neighborhoods source web page
source_url = 'https://en.wikipedia.org/wiki/List_of_Rome_Metro_stations'
source = requests.get(source_url).text

#load it in the scraping component
soup = BeautifulSoup(source, 'lxml')

#get the html element containing the needed informations
data_table = soup.select('table.wikitable.sortable')

#creating empty dataframe
columns = ['Tube Stop', 'Latitude', 'Longitude']
tube_stops_df = pd.DataFrame(columns = columns)

print("***Filling in Rome Tube Stops dataframe...***")
for line in data_table:
    stops = line.findAll('tr')[1:]
    for stop in stops:
        stop_name_col = stop.findAll('td')[0]
        stop_name = stop_name_col.text
        stop_coords_link = "https://en.wikipedia.org{}".format(stop_name_col.a.get('href'))
        #print("{} https://en.wikipedia.org{}".format(stop_name, stop_coords_link))
        
        #scraping on the linked page to find neighborhood coords
        source_url = stop_coords_link
        source = requests.get(source_url).text
        soup = BeautifulSoup(source, 'lxml')
        data_coords = soup.find(id='coordinates').find(class_='geo').text
        print("{}: retrieving coords. DONE.".format(stop_name.strip()))

        del soup #this line and the one below have been added for freezing issues
        gc.collect()
        
        dataframe_row = {'Tube Stop': stop_name.strip(), 'Latitude': data_coords.split(';')[0], 'Longitude': data_coords.split(';')[1]}
        tube_stops_df = tube_stops_df.append(dataframe_row, ignore_index = True)
print("***Filling in Rome Tube Stops dataframe. DONE.***")

tube_stops_df.head()

***Filling in Rome Tube Stops dataframe...***
Battistini: retrieving coords. DONE.
Cornelia: retrieving coords. DONE.
Baldo degli Ubaldi: retrieving coords. DONE.
Valle Aurelia: retrieving coords. DONE.
Cipro: retrieving coords. DONE.
Ottaviano - San Pietro - Musei Vaticani: retrieving coords. DONE.
Lepanto: retrieving coords. DONE.
Flaminio - Piazza del Popolo: retrieving coords. DONE.
Spagna: retrieving coords. DONE.
Barberini - Fontana di Trevi: retrieving coords. DONE.
Repubblica - Teatro dell'Opera: retrieving coords. DONE.
Termini: retrieving coords. DONE.
Vittorio Emanuele: retrieving coords. DONE.
Manzoni – Museo della Liberazione: retrieving coords. DONE.
San Giovanni: retrieving coords. DONE.
Re di Roma: retrieving coords. DONE.
Ponte Lungo: retrieving coords. DONE.
Furio Camillo: retrieving coords. DONE.
Colli Albani: retrieving coords. DONE.
Arco di Travertino: retrieving coords. DONE.
Porta Furba - Quadraro: retrieving coords. DONE.
Numidio Quadrato: retrieving coords. DON

Unnamed: 0,Tube Stop,Latitude,Longitude
0,Battistini,41.906461,12.414722
1,Cornelia,41.90222,12.42528
2,Baldo degli Ubaldi,41.89889,12.43278
3,Valle Aurelia,41.90306,12.44167
4,Cipro,41.9075,12.4475


In [10]:
tube_stops_df.dtypes

Tube Stop    object
Latitude     object
Longitude    object
dtype: object

In [11]:
#Converting Latitude and Longitude columns type to a numerical one 
#needed by Folium visualization library
cols = ['Latitude', 'Longitude']
for col in cols:  # Iterate over chosen columns
    tube_stops_df[col] = pd.to_numeric(tube_stops_df[col])
    
tube_stops_df.dtypes

Tube Stop     object
Latitude     float64
Longitude    float64
dtype: object

* ### Explore Neighborhoods in Rome

In [12]:
#import components needed to convert 
#an address into latitude and longitude values
print("INSTALLING Libraries required for geocoding...")
!conda install -c conda-forge geopy --yes
print("INSTALLING Libraries required for geocoding. DONE.")
from geopy.geocoders import Nominatim

INSTALLING Libraries required for geocoding...
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

INSTALLING Libraries required for geocoding. DONE.


In [13]:
#get the coordinates of Rome, IT
address = 'Rome, IT'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geographical coordinate of Rome are {}, {}.'.format(latitude, longitude))

The geographical coordinate of Rome are 41.8933203, 12.4829321.


In [14]:
#import components needed to create and show maps with markers
print("INSTALLING Libraries required for Maps generation...")
!conda install -c conda-forge folium=0.5.0 --yes
print("INSTALLING Libraries required for Maps generation. DONE.")
import folium # map rendering library

INSTALLING Libraries required for Maps generation...
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

INSTALLING Libraries required for Maps generation. DONE.


In [15]:
# create map of Rome using latitude and longitude values
map_rome = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers representing all Rome neighborhoods to the map
for lat, lng, neighborhood in zip(neighborhoods_df['Latitude'], neighborhoods_df['Longitude'], neighborhoods_df['Neighborhood']):
    label = '{}'.format(neighborhood)
    poiLabel = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=poiLabel,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rome)  
    
# add markers representing all Rome tube stos to the map
for lat, lng, tube_stop in zip(tube_stops_df['Latitude'], tube_stops_df['Longitude'], tube_stops_df['Tube Stop']):
    label = '{}'.format(tube_stop)
    poiLabel = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=poiLabel,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_rome)  
    
map_rome

* ### Analyze Each Neighborhood in Rome
  #### put in the cell below your Foursquare ID and Secret if you want to run the following requests to Foursquare API

In [59]:
# @hidden_cell
CLIENT_ID = '<YOUR FOURSQUARE ID>' # your Foursquare ID
CLIENT_SECRET = '<YOUR FOURSQUARE SECRET>' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentials:
CLIENT_ID: <YOUR FOURSQUARE ID>
CLIENT_SECRET:<YOUR FOURSQUARE SECRET>


In [19]:
# get Rome neighborhoods venues informations
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print("Getting VENUES information for Neighborhood {} ... DONE.".format(name))
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&query=food'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            100)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

rome_venues = getNearbyVenues(names=neighborhoods_df['Neighborhood'],
                                   latitudes=neighborhoods_df['Latitude'],
                                   longitudes=neighborhoods_df['Longitude']
                                  )
print(rome_venues.shape)
rome_venues.head(10)

Getting VENUES information for Neighborhood Parioli ... DONE.
Getting VENUES information for Neighborhood Pinciano ... DONE.
Getting VENUES information for Neighborhood Salario ... DONE.
Getting VENUES information for Neighborhood Nomentano ... DONE.
Getting VENUES information for Neighborhood Tiburtino ... DONE.
Getting VENUES information for Neighborhood Prenestino-Labicano ... DONE.
Getting VENUES information for Neighborhood Tuscolano ... DONE.
Getting VENUES information for Neighborhood Appio-Latino ... DONE.
Getting VENUES information for Neighborhood Ostiense ... DONE.
Getting VENUES information for Neighborhood Portuense ... DONE.
Getting VENUES information for Neighborhood Gianicolense ... DONE.
Getting VENUES information for Neighborhood Aurelio ... DONE.
Getting VENUES information for Neighborhood Trionfale ... DONE.
Getting VENUES information for Neighborhood Della Vittoria ... DONE.
Getting VENUES information for Neighborhood Monte Sacro ... DONE.
Getting VENUES informatio

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Parioli,41.92237,12.49237,Al Ceppo,41.92265,12.49289,Roman Restaurant
1,Parioli,41.92237,12.49237,Zero,41.922291,12.492854,Restaurant
2,Parioli,41.92237,12.49237,Le Sicilianedde,41.924175,12.490412,Diner
3,Parioli,41.92237,12.49237,Taverna Rossini,41.921848,12.492326,Italian Restaurant
4,Parioli,41.92237,12.49237,La baguetteria,41.925823,12.48944,Sandwich Place
5,Parioli,41.92237,12.49237,Pescheria Rossini,41.92154,12.492492,Seafood Restaurant
6,Parioli,41.92237,12.49237,Il Mattarello II,41.922344,12.492944,Italian Restaurant
7,Parioli,41.92237,12.49237,Caffè Rossini,41.921924,12.492313,Café
8,Parioli,41.92237,12.49237,Låcki,41.923991,12.493662,Scandinavian Restaurant
9,Parioli,41.92237,12.49237,Gargani,41.924704,12.490008,Deli / Bodega


In [20]:
# how many venues per Neighborhood?
rome_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Alessandrino,4,4,4,4,4,4
Appio-Latino,27,27,27,27,27,27
Appio-Pignatelli,4,4,4,4,4,4
Aurelio,17,17,17,17,17,17
Collatino,4,4,4,4,4,4
Della Vittoria,5,5,5,5,5,5
Don Bosco,26,26,26,26,26,26
Europa,40,40,40,40,40,40
Gianicolense,25,25,25,25,25,25
Lido di Castel Fusano,2,2,2,2,2,2


In [21]:
rome_venues.groupby('Neighborhood').count().shape

(31, 6)

In [22]:
#no infos about Ardeatino neighborhood from Foursquare API
rome_venues[rome_venues['Neighborhood'] == 'Ardeatino']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


In [23]:
#no infos about Ardeatino neighborhood from Foursquare API
rome_venues[rome_venues['Neighborhood'] == 'Giuliano-Dalmata']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


In [24]:
#no infos about Appio-Claudio neighborhood from Foursquare API
rome_venues[rome_venues['Neighborhood'] == 'Appio-Claudio']

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category


In [25]:
# Rome venues unique categories
print('There are {} uniques categories.'.format(len(rome_venues['Venue Category'].unique())))

There are 49 uniques categories.


#### Now create a dataframe in which are grouped the top common venues per Neighborhood

In [26]:
# one hot encoding
rome_onehot = pd.get_dummies(rome_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
rome_onehot['Neighborhood'] = rome_venues['Neighborhood'] 

neighborhood_col_tmp = rome_onehot['Neighborhood']
rome_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
rome_onehot.insert(0, 'Neighborhood', neighborhood_col_tmp)

#group rows by neighborhoods taking the mean of each VENUES CATEGORY occurrence frequency 
rome_grouped = rome_onehot.groupby('Neighborhood').mean().reset_index()
rome_grouped

Unnamed: 0,Neighborhood,African Restaurant,Asian Restaurant,BBQ Joint,Bakery,Bistro,Brazilian Restaurant,Breakfast Spot,Burger Joint,Cafeteria,...,Sandwich Place,Scandinavian Restaurant,Seafood Restaurant,Snack Place,Steakhouse,Sushi Restaurant,Thai Restaurant,Trattoria/Osteria,Turkish Restaurant,Vegetarian / Vegan Restaurant
0,Alessandrino,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0
1,Appio-Latino,0.0,0.0,0.0,0.0,0.0,0.0,0.037037,0.037037,0.0,...,0.037037,0.0,0.037037,0.0,0.0,0.0,0.037037,0.037037,0.0,0.0
2,Appio-Pignatelli,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0
3,Aurelio,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.058824,0.0,0.0
4,Collatino,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Della Vittoria,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.2,...,0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Don Bosco,0.0,0.0,0.0,0.038462,0.0,0.0,0.0,0.0,0.0,...,0.038462,0.0,0.0,0.0,0.0,0.038462,0.0,0.038462,0.0,0.0
7,Europa,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.025,...,0.075,0.0,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0
8,Gianicolense,0.0,0.04,0.0,0.0,0.0,0.0,0.04,0.0,0.0,...,0.08,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Lido di Castel Fusano,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
rome_grouped.shape

(31, 50)

In [28]:
# function that sorts the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]


In [29]:
# import library to handle data in a vectorized manner
import numpy as np

In [30]:
# new dataframe with the top venue per Neighborhood
num_top_venues = 1

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = rome_grouped['Neighborhood']

for ind in np.arange(rome_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(rome_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted

Unnamed: 0,Neighborhood,1st Most Common Venue
0,Alessandrino,Italian Restaurant
1,Appio-Latino,Italian Restaurant
2,Appio-Pignatelli,Italian Restaurant
3,Aurelio,Café
4,Collatino,Pizza Place
5,Della Vittoria,Café
6,Don Bosco,Pizza Place
7,Europa,Café
8,Gianicolense,Pizza Place
9,Lido di Castel Fusano,Restaurant


In [31]:
# one hot encoding
rome_top_venues_onehot = pd.get_dummies(neighborhoods_venues_sorted[['1st Most Common Venue']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
rome_top_venues_onehot['Neighborhood'] = neighborhoods_venues_sorted['Neighborhood'] 

neighborhood_col_tmp = rome_top_venues_onehot['Neighborhood']
rome_top_venues_onehot.drop(labels=['Neighborhood'], axis=1,inplace = True)
rome_top_venues_onehot.insert(0, 'Neighborhood', neighborhood_col_tmp)

#group rows by neighborhoods taking the mean of each VENUES CATEGORY occurrence frequency 
rome_top_venues_grouped = rome_top_venues_onehot.groupby('Neighborhood').mean().reset_index()
rome_top_venues_grouped

Unnamed: 0,Neighborhood,African Restaurant,Asian Restaurant,Café,Chinese Restaurant,Italian Restaurant,Japanese Restaurant,Pizza Place,Restaurant
0,Alessandrino,0,0,0,0,1,0,0,0
1,Appio-Latino,0,0,0,0,1,0,0,0
2,Appio-Pignatelli,0,0,0,0,1,0,0,0
3,Aurelio,0,0,1,0,0,0,0,0
4,Collatino,0,0,0,0,0,0,1,0
5,Della Vittoria,0,0,1,0,0,0,0,0
6,Don Bosco,0,0,0,0,0,0,1,0
7,Europa,0,0,1,0,0,0,0,0
8,Gianicolense,0,0,0,0,0,0,1,0
9,Lido di Castel Fusano,0,0,0,0,0,0,0,1


* ### Cluster Neighborhoods in Rome

#### Run k-means to cluster the neighborhood into 8 clusters.

In [32]:
# import k-means from clustering stage
from sklearn.cluster import KMeans

# set number of clusters
kclusters = 8 #the number of different top common venues for Rome neighborhoods

rome_grouped_clustering = rome_top_venues_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(rome_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_

array([1, 1, 1, 3, 2, 3, 2, 3, 2, 5, 3, 4, 0, 2, 1, 1, 3, 6, 1, 7, 2, 5,
       2, 2, 1, 2, 4, 3, 3, 1, 4], dtype=int32)

In [33]:
# add clustering labels
#neighborhoods_venues_sorted.drop(labels=['Cluster Labels'], axis=1,inplace = True) #added to avoid error when running 2 times this cell
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

rome_merged = neighborhoods_df

# merge rome_grouped with toronto_data to add latitude/longitude for each neighborhood
rome_merged = rome_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

rome_merged

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue
0,Parioli,41.92237,12.49237,3.0,Café
1,Pinciano,41.912846,12.489208,1.0,Italian Restaurant
2,Salario,41.913929,12.500492,1.0,Italian Restaurant
3,Nomentano,41.915061,12.517968,1.0,Italian Restaurant
4,Tiburtino,41.901092,12.526932,4.0,Japanese Restaurant
5,Prenestino-Labicano,41.8852,12.5381,2.0,Pizza Place
6,Tuscolano,41.869038,12.536473,4.0,Japanese Restaurant
7,Appio-Latino,41.8734,12.5163,1.0,Italian Restaurant
8,Ostiense,41.863651,12.478911,1.0,Italian Restaurant
9,Portuense,41.855239,12.452901,2.0,Pizza Place


In [34]:
#we remove the rows about 'Ardeatino' and 'Giuliano-Dalmata' neighborhoods as no informations from Foursquare API
rome_merged_cleaned = rome_merged[rome_merged.Neighborhood!='Ardeatino']
rome_merged_cleaned = rome_merged_cleaned[rome_merged.Neighborhood!='Giuliano-Dalmata']
rome_merged_cleaned = rome_merged_cleaned[rome_merged.Neighborhood!='Appio Claudio']
rome_merged_cleaned

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue
0,Parioli,41.92237,12.49237,3.0,Café
1,Pinciano,41.912846,12.489208,1.0,Italian Restaurant
2,Salario,41.913929,12.500492,1.0,Italian Restaurant
3,Nomentano,41.915061,12.517968,1.0,Italian Restaurant
4,Tiburtino,41.901092,12.526932,4.0,Japanese Restaurant
5,Prenestino-Labicano,41.8852,12.5381,2.0,Pizza Place
6,Tuscolano,41.869038,12.536473,4.0,Japanese Restaurant
7,Appio-Latino,41.8734,12.5163,1.0,Italian Restaurant
8,Ostiense,41.863651,12.478911,1.0,Italian Restaurant
9,Portuense,41.855239,12.452901,2.0,Pizza Place


In [35]:
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(rome_merged_cleaned['Latitude'], rome_merged_cleaned['Longitude'], rome_merged_cleaned['Neighborhood'], rome_merged_cleaned['Cluster Labels']):
    label = folium.Popup(str(poi) + ' (Cluster ' + str(int(cluster)) + ')', parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

* ### Examine Clusters in Rome

In [36]:
# Cluster 1
neighborhoods_cluster_1 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 0, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_1

Unnamed: 0,Neighborhood,1st Most Common Venue
14,Monte Sacro,Chinese Restaurant


In [37]:
# Cluster 2
neighborhoods_cluster_2 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 1, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_2

Unnamed: 0,Neighborhood,1st Most Common Venue
1,Pinciano,Italian Restaurant
2,Salario,Italian Restaurant
3,Nomentano,Italian Restaurant
7,Appio-Latino,Italian Restaurant
8,Ostiense,Italian Restaurant
12,Trionfale,Italian Restaurant
21,Alessandrino,Italian Restaurant
24,Appio-Pignatelli,Italian Restaurant


In [38]:
# Cluster 3
neighborhoods_cluster_3 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 2, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_3

Unnamed: 0,Neighborhood,1st Most Common Venue
5,Prenestino-Labicano,Pizza Place
9,Portuense,Pizza Place
10,Gianicolense,Pizza Place
20,Collatino,Pizza Place
22,Don Bosco,Pizza Place
25,Primavalle,Pizza Place
26,Monte Sacro Alto,Pizza Place
28,San Basilio,Pizza Place


In [39]:
# Cluster 4
neighborhoods_cluster_4 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 3, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_4

Unnamed: 0,Neighborhood,1st Most Common Venue
0,Parioli,Café
11,Aurelio,Café
13,Della Vittoria,Café
15,Trieste,Café
16,Tor di Quinto,Café
30,Europa,Café
32,Lido di Ostia Levante,Café


In [40]:
# Cluster 5
neighborhoods_cluster_5 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 4, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_5

Unnamed: 0,Neighborhood,1st Most Common Venue
4,Tiburtino,Japanese Restaurant
6,Tuscolano,Japanese Restaurant
31,Lido di Ostia Ponente,Japanese Restaurant


In [41]:
# Cluster 6
neighborhoods_cluster_6 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 5, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_6

Unnamed: 0,Neighborhood,1st Most Common Venue
17,Prenestino-Centocelle,Restaurant
33,Lido di Castel Fusano,Restaurant


In [42]:
# Cluster 7
neighborhoods_cluster_7 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 6, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_7

Unnamed: 0,Neighborhood,1st Most Common Venue
19,Pietralata,African Restaurant


In [43]:
# Cluster 8
neighborhoods_cluster_8 = rome_merged_cleaned.loc[rome_merged_cleaned['Cluster Labels'] == 7, rome_merged_cleaned.columns[[0] + list(range(4, rome_merged_cleaned.shape[1]))]]
neighborhoods_cluster_8

Unnamed: 0,Neighborhood,1st Most Common Venue
27,Ponte Mammolo,Asian Restaurant


* ### Now we calculate the number of Tube Stops per Neighborhoods (considering the same distance, 500 m, used as radius with Foursquare API for venues) and merge with the Clusters informations

In [44]:
#define a function to calculate distance between 2 points
#based on latitude and longitude
#from: https://kanoki.org/2019/02/14/how-to-find-distance-between-two-points-based-on-latitude-and-longitude-using-python-and-sql/
from math import radians, cos, sin, asin, sqrt
def haversine(lat1, lon1, lat2, lon2):
  # convert decimal degrees to radians 
  lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
  # haversine formula 
  dlon = lon2 - lon1 
  dlat = lat2 - lat1 
  a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
  c = 2 * asin(sqrt(a)) 
  r = 6371 # Radius of earth in kilometers. Use 3956 for miles
  return c * r

In [45]:
#create a dataframe with the number of tube stops per neighborhood

#creating empty dataframe
columns = ['Neighborhood', 'Latitude', 'Longitude', "Number_Of_Tube_Stops (radius<=500)"]
neighborhoods_tube_stops_df = pd.DataFrame(columns = columns)

new_df = tube_stops_df[['Latitude', 'Longitude']]

for latitude, longitude, neighborhood in zip(rome_merged_cleaned['Latitude'], rome_merged_cleaned['Longitude'], rome_merged_cleaned['Neighborhood']):
    ref_point = (latitude, longitude)
    
    stops_count = 0
    for tube_stop_name, latitude, longitude in zip(tube_stops_df['Tube Stop'], tube_stops_df['Latitude'], tube_stops_df['Longitude']):
        if(haversine(ref_point[0], ref_point[1], latitude, longitude) <= 0.5):
            #print('Tube Stop {} counted for {}'.format(tube_stop_name, neighborhood))
            stops_count += 1
    #print('{} has {} tube stops nearby'.format(neighborhood, stops_count))
    #print('')
    #print('')

    dataframe_row = {'Neighborhood': neighborhood.strip(), 'Latitude': latitude, 'Longitude': longitude, 'Number_Of_Tube_Stops (radius<=500)': stops_count}
    neighborhoods_tube_stops_df = neighborhoods_tube_stops_df.append(dataframe_row, ignore_index = True)
    
neighborhoods_tube_stops_df

Unnamed: 0,Neighborhood,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
0,Parioli,41.865688,12.707623,0
1,Pinciano,41.865688,12.707623,0
2,Salario,41.865688,12.707623,0
3,Nomentano,41.865688,12.707623,1
4,Tiburtino,41.865688,12.707623,0
5,Prenestino-Labicano,41.865688,12.707623,1
6,Tuscolano,41.865688,12.707623,1
7,Appio-Latino,41.865688,12.707623,0
8,Ostiense,41.865688,12.707623,1
9,Portuense,41.865688,12.707623,0


In [46]:
#merge tube stops infos with clusters dataframes
neighborhoods_cluster_1_merged = neighborhoods_cluster_1
neighborhoods_cluster_1_merged = neighborhoods_cluster_1_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_1_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
14,Monte Sacro,Chinese Restaurant,41.865688,12.707623,1


In [47]:
neighborhoods_cluster_2_merged = neighborhoods_cluster_2
neighborhoods_cluster_2_merged = neighborhoods_cluster_2_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_2_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
1,Pinciano,Italian Restaurant,41.865688,12.707623,0
2,Salario,Italian Restaurant,41.865688,12.707623,0
3,Nomentano,Italian Restaurant,41.865688,12.707623,1
7,Appio-Latino,Italian Restaurant,41.865688,12.707623,0
8,Ostiense,Italian Restaurant,41.865688,12.707623,1
12,Trionfale,Italian Restaurant,41.865688,12.707623,0
21,Alessandrino,Italian Restaurant,41.865688,12.707623,0
24,Appio-Pignatelli,Italian Restaurant,41.865688,12.707623,0


In [48]:
neighborhoods_cluster_3_merged = neighborhoods_cluster_3
neighborhoods_cluster_3_merged = neighborhoods_cluster_3_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_3_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
5,Prenestino-Labicano,Pizza Place,41.865688,12.707623,1
9,Portuense,Pizza Place,41.865688,12.707623,0
10,Gianicolense,Pizza Place,41.865688,12.707623,0
20,Collatino,Pizza Place,41.865688,12.707623,0
22,Don Bosco,Pizza Place,41.865688,12.707623,1
25,Primavalle,Pizza Place,41.865688,12.707623,0
26,Monte Sacro Alto,Pizza Place,41.865688,12.707623,0
28,San Basilio,Pizza Place,41.865688,12.707623,0


In [49]:
neighborhoods_cluster_4_merged = neighborhoods_cluster_4
neighborhoods_cluster_4_merged = neighborhoods_cluster_4_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_4_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
0,Parioli,Café,41.865688,12.707623,0
11,Aurelio,Café,41.865688,12.707623,0
13,Della Vittoria,Café,41.865688,12.707623,0
15,Trieste,Café,41.865688,12.707623,1
16,Tor di Quinto,Café,41.865688,12.707623,0
30,Europa,Café,41.865688,12.707623,2
32,Lido di Ostia Levante,Café,41.865688,12.707623,0


In [50]:
neighborhoods_cluster_5_merged = neighborhoods_cluster_5
neighborhoods_cluster_5_merged = neighborhoods_cluster_5_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_5_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
4,Tiburtino,Japanese Restaurant,41.865688,12.707623,0
6,Tuscolano,Japanese Restaurant,41.865688,12.707623,1
31,Lido di Ostia Ponente,Japanese Restaurant,41.865688,12.707623,0


In [51]:
neighborhoods_cluster_6_merged = neighborhoods_cluster_6
neighborhoods_cluster_6_merged = neighborhoods_cluster_6_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_6_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
17,Prenestino-Centocelle,Restaurant,41.865688,12.707623,2
33,Lido di Castel Fusano,Restaurant,41.865688,12.707623,0


In [52]:
neighborhoods_cluster_7_merged = neighborhoods_cluster_7
neighborhoods_cluster_7_merged = neighborhoods_cluster_7_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_7_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
19,Pietralata,African Restaurant,41.865688,12.707623,2


In [53]:
neighborhoods_cluster_8_merged = neighborhoods_cluster_8
neighborhoods_cluster_8_merged = neighborhoods_cluster_8_merged.join(neighborhoods_tube_stops_df.set_index('Neighborhood'), on='Neighborhood')
neighborhoods_cluster_8_merged

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
27,Ponte Mammolo,Asian Restaurant,41.865688,12.707623,0


## Results

An analysis of Rome neighborhoods food related venues and tube stops has been done **to identify the best places for opening a Pizza Shop**. 
* <p>We considered the list of neighborhoods in Rome (blue circles) and the tube stops (red circle)</p>

In [54]:
map_rome

* the top common venues category (food related) per neighborhood and we created clusters using k-means algorithm

In [55]:
map_clusters

where cluster 3 is composed of Neighborhoods with Pizza Place as the top common(food related) venues category.

* the clusters informations merged with the tube stops ones (***not considering cluster for Pizza Places as top common venues and the neighborhoods without at least one tube stop in the radius of 500 meters***)

In [57]:
neighbordhoods_with_at_least_ne_tube_stop_df = pd.concat([
    neighborhoods_cluster_1_merged[neighborhoods_cluster_1_merged['Number_Of_Tube_Stops (radius<=500)']>0],
    neighborhoods_cluster_2_merged[neighborhoods_cluster_2_merged['Number_Of_Tube_Stops (radius<=500)']>0],
    neighborhoods_cluster_4_merged[neighborhoods_cluster_4_merged['Number_Of_Tube_Stops (radius<=500)']>0],
    neighborhoods_cluster_5_merged[neighborhoods_cluster_5_merged['Number_Of_Tube_Stops (radius<=500)']>0],
    neighborhoods_cluster_6_merged[neighborhoods_cluster_6_merged['Number_Of_Tube_Stops (radius<=500)']>0],
    neighborhoods_cluster_7_merged[neighborhoods_cluster_7_merged['Number_Of_Tube_Stops (radius<=500)']>0]
])
neighbordhoods_with_at_least_ne_tube_stop_df                                                          

Unnamed: 0,Neighborhood,1st Most Common Venue,Latitude,Longitude,Number_Of_Tube_Stops (radius<=500)
14,Monte Sacro,Chinese Restaurant,41.865688,12.707623,1
3,Nomentano,Italian Restaurant,41.865688,12.707623,1
8,Ostiense,Italian Restaurant,41.865688,12.707623,1
15,Trieste,Café,41.865688,12.707623,1
30,Europa,Café,41.865688,12.707623,2
6,Tuscolano,Japanese Restaurant,41.865688,12.707623,1
17,Prenestino-Centocelle,Restaurant,41.865688,12.707623,2
19,Pietralata,African Restaurant,41.865688,12.707623,2


## Discussion Section

With respect to the question<br>
**"where is it possible to find a place in Rome that it is near a tube stop and where there is not too much competitions for a Pizza Shop?"**<br>
it was found that there are some neighborhoods that satisfies these requirements (see the last table in the "Results"  section above).

These results should be interpreted with caution for the following reason:
* it is considered a radius of 500 meters far from Neighborhood position for venues and tube stops but it should be checked how accurate is the position of Neighborhoods and Tube Stops got from the web pages used as source (see "Data" section)
* a radius of 500 meters is very limited and could not represent the venues and tube stops actual density for a neighborhood
* tube stops are considered like all having the some weight (i.e.: the some number of people going in/out the tube stop)
* the parameters for choosing the best position for a Pizza Shop are limited as are considered only tube stops but could be considered also others like nearby touristic attractions, parking lots (so considering also customers using cars), shopping centers, etc...  

The reliability of the results could be improved removing one of more of the limits written above for doing further research.

## Conclusion

In the "Introduction" section of this report there is a hypothesis about the fact that the best place for a Pizza Shop is the one with less competitors and a lot of people passing nearby. 
After some analysis of the data (see "Data" section above) the best choices for a Pizza Shop (see the last table in the "Results" section above) seems to be the following neighborhoods: 
* Europa
* Prenestino-Centocelle
* Pietralata

As stated in the "Discussion" section It has been set strong limits to the hypothesis for simplifying the data analysis. I reccomend to remove these limits for further and more reliable research.