# Battle of Neighbourhoods - Week 1


# Part 1: A description of the problem and a discussion of the background

> ### 1.1 Description of the Problem

The population of Italy has grown considerably over the last decades. Italy is very diverse. In Africa you can get variety of fresh food from supplies from Africa.
 
There are many fine restaurants in Africa – Asian, Middle Eastern, Latin and American restaurants, it is hard to find a good place to dine in.


> ### 1.2 Discussion of the Background

My client is a successful restaurant chain in Ghana; he is looking to expand operation into EUROP starting in Italy. His target is luxury restaurant
Since Italy demography has many Regions, My client needs deeper insight from available data in other to decide where to establish first. This company spends a lot on research and provides customers with data insight into the ingredients used at restaurants.


# Part 2 - A description of the data and how it will be used to solve the problem

> ### 2.1 Description of Data


Data will be from public data, Wikipedia and Foursquare.

Italy Area consists of many regions which has some specific popular areas < https://en.wikipedia.org/wiki/List_of_cities_in_Italy >

#### A sample of the web scrapped of the Wikipedia page 

In [123]:
from bs4 import BeautifulSoup


import numpy as np


import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


import json
print('numpy, pandas, ..., imported...')

!pip -q install geopy

print('geopy installed...')

from geopy.geocoders import Nominatim
print('Nominatim imported...')


import requests
print('requests imported...')


from pandas.io.json import json_normalize
print('json_normalize imported...')


import matplotlib.cm as cm
import matplotlib.colors as colors
print('matplotlib imported...')


from sklearn.cluster import KMeans
print('Kmeans imported...')


!pip -q install geocoder
import geocoder


import time


!pip -q install folium
print('folium installed...')
import folium # map rendering library
print('folium imported...')
print('...Done')

numpy, pandas, ..., imported...
geopy installed...
Nominatim imported...
requests imported...
json_normalize imported...
matplotlib imported...
Kmeans imported...
folium installed...
folium imported...
...Done


In [124]:
wikipedia_link = 'https://en.wikipedia.org/wiki/List_of_cities_in_Italy'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:64.0) Gecko/20100101 Firefox/64.0'}
wikipedia_page = requests.get(wikipedia_link, headers = headers)
wikipedia_page

<Response [200]>

In [125]:
soup = BeautifulSoup(wikipedia_page.content, 'html.parser')
table = soup.find('table', {'class':'wikitable sortable'}).tbody
rows = table.find_all('tr')
columns = [i.text.replace('\n', '')
           for i in rows[0].find_all('th')]
df = pd.DataFrame(columns = columns)
df

Unnamed: 0,Rank,City,2011 Census,2017 Estimate,Change,Region


In [126]:
for i in range(1, len(rows)):
    tds = rows[i].find_all('td')
    
    
    if len(tds) == 7:
        values = [tds[0].text, tds[1].text, tds[2].text.replace('\n', ''.replace('\xa0','')), tds[3].text, tds[4].text.replace('\n', ''.replace('\xa0','')), tds[5].text.replace('\n', ''.replace('\xa0','')), tds[6].text.replace('\n', ''.replace('\xa0',''))]
    else:
        values = [td.text.replace('\n', '').replace('\xa0','') for td in tds]
        
        df = df.append(pd.Series(values, index = columns), ignore_index = True)

        df

In [127]:
df.head(5)

Unnamed: 0,Rank,City,2011 Census,2017 Estimate,Change,Region
0,1,Rome,2617175,2872800,+9.77%,Lazio
1,2,Milan,1242123,1366180,+9.99%,Lombardy
2,3,Naples,962003,966144,+0.43%,Campania
3,4,Turin,872367,882523,+1.16%,Piedmont
4,5,Palermo,657651,668405,+1.64%,Sicily


In [128]:
df['City'] = df['City'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))

In [129]:
df0 = df.drop('City', axis=1).join(df['City'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('City'))

In [130]:
df1 = df0[['Rank', 'City', 'Region']].reset_index(drop=True)

In [131]:
df1.head(5)

Unnamed: 0,Rank,City,Region
0,1,Rome,Lazio
1,2,Milan,Lombardy
2,3,Naples,Campania
3,4,Turin,Piedmont
4,5,Palermo,Sicily


In [132]:
df2 = df1

In [133]:
df21 = df2[df2['Region'].str.contains('Lazio')]

In [134]:
df21.head(10)

Unnamed: 0,Rank,City,Region
0,1,Rome,Lazio
31,32,Latina,Lazio
57,58,Guidonia Montecelio,Lazio
69,70,Fiumicino,Lazio
75,76,Aprilia,Lazio
87,88,Viterbo,Lazio
94,95,Pomezia,Lazio
114,115,Tivoli,Lazio
120,121,Anzio,Lazio
129,130,Velletri,Lazio


In [135]:
df3 = df21[['Rank', 'City', 'Region']].reset_index(drop=True)

In [136]:
df_Lazio = df3

In [137]:
df21.shape

(11, 3)

### 2.1.2 Dataset 2:

In obtaining the location data of the locations, the Geocoder package is used with the arcgis_geocoder to obtain the latitude and longitude of the needed locations.

In [138]:
'''Geocoder starts here'''
'''Defining a function to use --> get_latlng()'''
def get_latlng(arcgis_geocoder):
    
   
    lat_lng_coords = None
    
   
    
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, Lazio, Italy'.format(arcgis_geocoder))
        lat_lng_coords = g.latlng
    return lat_lng_coords
'''Geocoder ends here'''

'Geocoder ends here'

In [139]:
City_codes = df['City']    
coordinates = [get_latlng(City_code) for City_code in City_codes.tolist()]

In [140]:
df_Lazio_loc = df_Lazio

df_Lazio_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_Lazio_loc['Latitude'] = df_Lazio_coordinates['Latitude']
df_Lazio_loc['Longitude'] = df_Lazio_coordinates['Longitude']

## Testing Sample for Rome

In [141]:
sample = get_latlng('Rome')
sample

[41.90322000000003, 12.495650000000069]

## Reverse geocoding this, using the geocodefarm geocoder, gives the following:

In [142]:
gg = geocoder.geocodefarm(sample, method = 'reverse')
gg

<[OK] Geocodefarm - Reverse [Via Vittorio Emanuele Orlando 70, 00185 Rome Rome, Italy]>

## Applying the function to find Latitude and Longitude

In [143]:
start = time.time()

City_codes = df_Lazio['Region']    
coordinates = [get_latlng(City_code) for City_code in City_codes.tolist()]

end = time.time()
print("Time of execution: ", end - start, "seconds")

Time of execution:  5.4745848178863525 seconds


## Then we proceed to store the location data - latitude and longitude as follows. 

In [144]:
df_se_loc = df_Lazio


df_se_coordinates = pd.DataFrame(coordinates, columns = ['Latitude', 'Longitude'])
df_se_loc['Latitude'] = df_se_coordinates['Latitude']
df_se_loc['Longitude'] = df_se_coordinates['Longitude']

In [145]:
df_Lazio_loc.head()

Unnamed: 0,Rank,City,Region,Latitude,Longitude
0,1,Rome,Lazio,41.975682,12.772626
1,32,Latina,Lazio,41.975682,12.772626
2,58,Guidonia Montecelio,Lazio,41.975682,12.772626
3,70,Fiumicino,Lazio,41.975682,12.772626
4,76,Aprilia,Lazio,41.975682,12.772626


In [146]:
df_Lazio_loc.to_csv('SELazioLocationsCoordinates.csv', index = False)

# 3 Methodology

### 3.1 Data Exploration


#### 3.1.1 Single Neighbourhood

In [147]:
La_df = df_Lazio_loc.reset_index().drop('index', axis = 1)

In [148]:
La_df.shape

(11, 5)

In [149]:
La_df

Unnamed: 0,Rank,City,Region,Latitude,Longitude
0,1,Rome,Lazio,41.975682,12.772626
1,32,Latina,Lazio,41.975682,12.772626
2,58,Guidonia Montecelio,Lazio,41.975682,12.772626
3,70,Fiumicino,Lazio,41.975682,12.772626
4,76,Aprilia,Lazio,41.975682,12.772626
5,88,Viterbo,Lazio,41.975682,12.772626
6,95,Pomezia,Lazio,41.975682,12.772626
7,115,Tivoli,Lazio,41.975682,12.772626
8,121,Anzio,Lazio,41.975682,12.772626
9,130,Velletri,Lazio,41.975682,12.772626


In [150]:
La_df.loc[La_df['City'] == 'Rome']

Unnamed: 0,Rank,City,Region,Latitude,Longitude
0,1,Rome,Lazio,41.975682,12.772626


Let's use Rome with the index location 0.

In [151]:
La_df.loc[0, 'City']

'Rome'

In [152]:
Rome_lat = La_df.loc[0, 'Latitude']
Rome_long = La_df.loc[0, 'Longitude']
Rome_Cit = La_df.loc[0, 'City']
Rome_Region = La_df.loc[0, 'Region']

print('The latitude and longitude values of {} with Region {}, are {}, {}.'.format(Rome_Cit,
                                                                                         Rome_Region,
                                                                                         Rome_lat,
                                                                                       Rome_long))

The latitude and longitude values of Rome with Region Lazio, are 41.975681888000054, 12.772626325000033.


# 3. Methodology 

This section represents the main component of the report where the data is gathered, prepared for analysis. The tools described are used here and the Notebook cells indicates the execution of steps.

### METHODOLOY EXECUTION - Mapping Data

In [153]:
import json
print('numpy, pandas, ..., imported...')

from pandas.io.json import json_normalize
print('json_normalize imported...')



numpy, pandas, ..., imported...
json_normalize imported...


In [154]:
address = 'Lazio, Italy'
geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Italy home are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Italy home are 41.9808038, 12.7662312.


In [155]:
neighborhood_latitude=41.9808038
neighborhood_longitude=12.7662312

In [156]:
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize

!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library

print('Folium installed')
print('Libraries imported.')

Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
geopy                     1.19.0                     py_0    conda-forge
Fetching package metadata .............
Solving package specifications: .

# All requested packages already installed.
# packages in environment at /opt/conda/envs/DSX-Python35:
#
folium                    0.5.0                      py_0    conda-forge
Folium installed
Libraries imported.


explore the top 100 venues that are within a 2000 metres radius of Lazio.
And then, let's create the GET request URL, and then the url is named.

In [232]:
CLIENT_ID = 'FL1GDNSXKBRH2XQ4KIZP2ACOZQ0TQRG0LVFZZNJSY4BTA452'
CLIENT_SECRET = 'YPV5PQBM5G4KSKXKY2XGRSIFM4O2V5N1LKALCGIMFDTZYAD0' 
VERSION = '20190416'
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: FL1GDNSXKBRH2XQ4KIZP2ACOZQ0TQRG0LVFZZNJSY4BTA452
CLIENT_SECRET:YPV5PQBM5G4KSKXKY2XGRSIFM4O2V5N1LKALCGIMFDTZYAD0


In [233]:
LIMIT = 100 
radius = 2000 
url = 'https://api.foursquare.com/v2/venues/search?ll=&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Rome_lat, 
    Rome_long, 
    radius, 
    LIMIT)


url

'https://api.foursquare.com/v2/venues/search?ll=&client_id=FL1GDNSXKBRH2XQ4KIZP2ACOZQ0TQRG0LVFZZNJSY4BTA452&client_secret=YPV5PQBM5G4KSKXKY2XGRSIFM4O2V5N1LKALCGIMFDTZYAD0&v=20190416&ll=41.975681888000054,12.772626325000033&radius=2000&limit=100'

In [234]:
results = requests.get(url).json()
results

{'meta': {'code': 400,
  'errorDetail': 'll must be of the form XX.XX,YY.YY (received )',
  'errorType': 'param_error',
  'requestId': '5cb617031ed2196d9025feae'},
 'response': {}}

In [235]:
address = 'Via Vittorio Emanuele Orlando 70, 00185 Rome Rome, Italy'

geolocator = Nominatim(user_agent="foursquare_agent")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print(latitude, longitude)

41.9031408 12.4957243


In [236]:
latitude = 41.9031408
longitude = 12.4957243

In [243]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 2000 # define radius
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Rome_lat, 
    Rome_long, 
    radius, 
    LIMIT)

# displays URL
url

'https://api.foursquare.com/v2/venues/explore?&client_id=FL1GDNSXKBRH2XQ4KIZP2ACOZQ0TQRG0LVFZZNJSY4BTA452&client_secret=YPV5PQBM5G4KSKXKY2XGRSIFM4O2V5N1LKALCGIMFDTZYAD0&v=20190416&ll=41.975681888000054,12.772626325000033&radius=2000&limit=100'

In [244]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5cb61843dd57974078f16f71'},
 'response': {'groups': [{'items': [{'reasons': {'count': 0,
       'items': [{'reasonName': 'globalInteractionReason',
         'summary': 'This spot is popular',
         'type': 'general'}]},
      'referralId': 'e-0-4f850f3ae4b0ae40e13808f9-0',
      'venue': {'categories': [{'icon': {'prefix': 'https://ss3.4sqi.net/img/categories_v2/shops/food_grocery_',
          'suffix': '.png'},
         'id': '52f2ab2ebcbc57f1066b8b46',
         'name': 'Supermarket',
         'pluralName': 'Supermarkets',
         'primary': True,
         'shortName': 'Supermarket'}],
       'id': '4f850f3ae4b0ae40e13808f9',
       'location': {'address': 'Via Tiburtina',
        'cc': 'IT',
        'city': 'Villanova',
        'country': 'Italia',
        'distance': 1521,
        'formattedAddress': ['Via Tiburtina', 'Villanova Lazio', 'Italia'],
        'labeledLatLngs': [{'label': 'display',
          'lat': 41.96785784723111,
          'l

From the results, the necessary information needs to be obtained from items key. To do this, the get_category_type function is used from the Foursquare lab.

In [245]:
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

The result is then cleaned up from json to a structured pandas dataframe as shown below:

In [246]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) 


filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]


nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)


nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

In [247]:
nearby_venues

Unnamed: 0,name,categories,lat,lng
0,Carrefour Market,Supermarket,41.967858,12.757551
1,Pizzeria da nonna Concetta,Pizza Place,41.986271,12.766311
2,Park Hotel Imperatore Adriano,Hotel,41.962601,12.761033
3,piazza villanova,Plaza,41.963789,12.755968


In [249]:
nearby_venues_Rome_unique = nearby_venues['categories'].value_counts().to_frame(name='Count')

In [250]:
nearby_venues_Rome_unique.head(5)

Unnamed: 0,Count
Pizza Place,1
Plaza,1
Hotel,1
Supermarket,1


Interestingly, even though there are restaurants are the Rome area, they are not even in the top 5 venues. It should be noted that since we are limited by data availability, our perspectives will be on what we have.

In [251]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

4 venues were returned by Foursquare.


3.1.2 Multiple Neighbourhoods

explore (Multiple) Neighborhoods in Rome City

In [252]:
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighbourhood', 
                  'Neighbourhood Latitude', 
                  'Neighbourhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [254]:
La_venues = getNearbyVenues(names=La_df['City'],
                                   latitudes=La_df['Latitude'],
                                   longitudes=La_df['Longitude']
                                  )

Rome
Latina
Guidonia Montecelio
Fiumicino
Aprilia
Viterbo
Pomezia
Tivoli
Anzio
Velletri
Civitavecchia


In [255]:
La_venues.shape

(44, 7)

In [256]:
len(La_venues)

44

In [257]:
La_venues['Neighbourhood'].value_counts()
La_venues.to_csv('La_venues.csv')

In [258]:
La_venues.head(5)

Unnamed: 0,Neighbourhood,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Rome,41.975682,12.772626,Carrefour Market,41.967858,12.757551,Supermarket
1,Rome,41.975682,12.772626,Pizzeria da nonna Concetta,41.986271,12.766311,Pizza Place
2,Rome,41.975682,12.772626,Park Hotel Imperatore Adriano,41.962601,12.761033,Hotel
3,Rome,41.975682,12.772626,piazza villanova,41.963789,12.755968,Plaza
4,Latina,41.975682,12.772626,Carrefour Market,41.967858,12.757551,Supermarket


In [259]:
La_venues.groupby('Neighbourhood').count()

Unnamed: 0_level_0,Neighbourhood Latitude,Neighbourhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Anzio,4,4,4,4,4,4
Aprilia,4,4,4,4,4,4
Civitavecchia,4,4,4,4,4,4
Fiumicino,4,4,4,4,4,4
Guidonia Montecelio,4,4,4,4,4,4
Latina,4,4,4,4,4,4
Pomezia,4,4,4,4,4,4
Rome,4,4,4,4,4,4
Tivoli,4,4,4,4,4,4
Velletri,4,4,4,4,4,4


The next step is to check how many unique categories can be returned for the venues. See as follows:

In [261]:
print('There are {} uniques categories.'.format(len(La_venues['Venue Category'].unique())))

There are 4 uniques categories.


In [262]:
La_venue_unique_count = La_venues['Venue Category'].value_counts().to_frame(name='Count')

In [263]:
La_venue_unique_count.head(5)

Unnamed: 0,Count
Pizza Place,11
Plaza,11
Hotel,11
Supermarket,11


In [264]:
La_venue_unique_count.describe()

Unnamed: 0,Count
count,4.0
mean,11.0
std,0.0
min,11.0
25%,11.0
50%,11.0
75%,11.0
max,11.0


### 3.2 Clustering 

 clustered based on the processed data obtained above.

#### 3.2.1 Libraries

All the necessary libraries have been called in the libraries section above.

#### 3.2.2 Map Visualization

In [4]:
address = 'Lazio, Italy'

geolocator = Nominatim(user_agent="ln_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

NameError: name 'Nominatim' is not defined