# Capstone Project - The Battle of the Neighborhoods (Week 2)
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a Bar. Specifically, this report will be targeted to stakeholders interested in opening a **Beer Bar** in **Santiago**, Chile.

Since there are lots of Bars in some sectors of Santiago we will try to detect **locations that are not already crowded with Bars**. We are also particularly interested in **areas with no Beer Bars in vicinity**. We would also prefer locations **as close to Plaza Ñuñoa as possible**, assuming that first two conditions are met. This city borough is very well known for its nightlife and other food and drinks venues.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing Bars in the neighborhood (any type of Bar)
* number of and distance to Beer Bars in the neighborhood, if any
* distance of neighborhood from location of interest

We decided to use regularly spaced grid of locations, centered around Plaza Ñuñoa, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **Google Maps API reverse geocoding**
* number of Bars and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of the location of interest in Santiago (around Plaza Ñuñoa) will be obtained using **Google Maps API geocoding** 

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around Plaza Ñuñoa.

Let's first find the latitude & longitude of Plaza Ñuñoa, using specific, well known address and Google Maps geocoding API.

In [1]:
# The code was removed by Watson Studio for sharing.

In [2]:
import requests

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
    
address = 'Plaza Ñuñoa, Santiago, Chile'
p_nunoa = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, p_nunoa))

Coordinate of Plaza Ñuñoa, Santiago, Chile: [-33.456053, -70.5937558]


Now let's create a grid of area candidates, equaly spaced, centered around Plaza Ñuñoa and within ~6km from it. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [3]:
!pip install shapely
import shapely.geometry

!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj='utm', zone=19, datum='WGS84') # utm zone of Santiago is 19
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj='utm', zone=19, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Plaza Ñuñoa longitude={}, latitude={}'.format(p_nunoa[1], p_nunoa[0]))

x, y = lonlat_to_xy(p_nunoa[1], p_nunoa[0])
print('Plaza Ñuñoa UTM X={}, Y={}'.format(x, y))

lo, la = xy_to_lonlat(x, y)
print('Plaza Ñuñoa longitude={}, latitude={}'.format(lo, la))

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting shapely
  Downloading Shapely-1.7.1-cp37-cp37m-manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 20.5 MB/s eta 0:00:01
[?25hInstalling collected packages: shapely
Successfully installed shapely-1.7.1
  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting pyproj
  Downloading pyproj-3.1.0-cp37-cp37m-manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 11.8 MB/s eta 0:00:01
Installing collected packages: pyproj
Successfully installed pyproj-3.1.0
Coordinate transformation check
-------------------------------
Plaza Ñuñoa longitude=-70.5937558, latitude=-33.456053
Plaza Ñuñoa UTM X=351881.08303354325, Y=-3702982.7586080655
Plaza Ñuñoa longitude=-70.5937558, latitude=-33.456053




Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [4]:
p_nunoa_x, p_nunoa_y = lonlat_to_xy(p_nunoa[1], p_nunoa[0]) # center location in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = p_nunoa_x - 6000
x_step = 600
y_min = p_nunoa_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []

for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(p_nunoa_x, p_nunoa_y, x, y)
        
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')



364 candidate neighborhood centers generated.




Let's visualize the data we have so far: city center location and candidate neighborhood centers:

In [5]:
!pip install folium

import folium

  from cryptography.utils import int_from_bytes
  from cryptography.utils import int_from_bytes
Collecting folium
  Downloading folium-0.12.1-py2.py3-none-any.whl (94 kB)
[K     |████████████████████████████████| 94 kB 7.2 MB/s  eta 0:00:01
[?25hCollecting branca>=0.3.0
  Downloading branca-0.4.2-py3-none-any.whl (24 kB)
Installing collected packages: branca, folium
Successfully installed branca-0.4.2 folium-0.12.1


In [6]:
map_santiago = folium.Map(location=p_nunoa, zoom_start=13)
folium.Marker(p_nunoa, popup='Plaza Ñuñoa').add_to(map_santiago)

for lat, lon in zip(latitudes, longitudes):
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_santiago)

map_santiago

OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from Plaza Ñuñoa. 

Let's now use Google Maps API to get approximate addresses of those locations.

In [7]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        
        return address
    except:
        return None

addr = get_address(google_api_key, p_nunoa[0], p_nunoa[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(p_nunoa[0], p_nunoa[1], addr))

Reverse geocoding check
-----------------------
Address of [-33.456053, -70.5937558] is: Humberto Trucco 45, Ñuñoa, Región Metropolitana, Chile


In [8]:
print('Obtaining location addresses: ', end='')
addresses = []

for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    
    if address is None:
        address = 'NO ADDRESS'
        
    address = address.replace(', Chile', '') # We don't need country part of address
    address = address.replace(', Región Metropolitana', '') # We don't need the region part of address
    addresses.append(address)
    print(' .', end='')
    
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [9]:
addresses[150:170]

['Calle Ocho 8073, Penalolen, Peñalolén',
 'Volcán Villarrica 792, Penalolen, Peñalolén',
 'Quebrada de Vitor 605, Penalolen, Peñalolén',
 'PA436-Avenida Matta / Esq. Nataniel Cox, Santiago',
 'San Francisco 1165, Santiago',
 'Carmen 1259, Santiago',
 'Av. Portugal 1341, Santiago',
 'San Eugenio 890, Ñuñoa',
 'Obispo Orrego 652, Ñuñoa',
 'Pedro Aguirre Cerda 785, Ñuñoa',
 'Villoslava 760, Ñuñoa',
 'Exequiel Fernandez 653, Ñuñoa',
 'Dr Johow 600, Ñuñoa',
 'Juan Moya Morales 595, Ñuñoa',
 'Ramón Cruz Montt 633, Ñuñoa',
 'Vasco de Gama 5527, Ñuñoa',
 'Vasco de Gama 773, Penalolen, Peñalolén',
 'Fidias 671, La Reina',
 'Leonidas Banderas 7130, La Reina',
 'Jorge Alessandri 1025, La Reina']

Looking good. Let's now place all this into a Pandas dataframe.

In [10]:
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"Vicuña Mackenna 5537, San Joaquín, Macul",-33.507336,-70.614073,350081.083034,-3708699.0,5992.495307
1,"Unnamed Road, Macul",-33.50742,-70.607615,350681.083034,-3708699.0,5840.3767
2,"Froilán Roa 5900, Macul",-33.507504,-70.601158,351281.083034,-3708699.0,5747.173218
3,"Av. Departamental 4400, Macul",-33.507587,-70.5947,351881.083034,-3708699.0,5715.767665
4,"Villa Gildemeister 4372, Penalolen, Peñalolén",-33.50767,-70.588242,352481.083034,-3708699.0,5747.173218
5,"Chillán 6092, Penalolen, Peñalolén",-33.507753,-70.581785,353081.083034,-3708699.0,5840.3767
6,"Los Cerezos 6046, Penalolen, Peñalolén",-33.507835,-70.575327,353681.083034,-3708699.0,5992.495307
7,"Piramide 237, San Joaquín",-33.502525,-70.623672,349181.083034,-3708179.0,5855.766389
8,"Emco 5080, San Joaquín",-33.502609,-70.617214,349781.083034,-3708179.0,5604.462508
9,"José Caroca 2087, Macul",-33.502694,-70.610757,350381.083034,-3708179.0,5408.326913


...and let's now save/persist this data into local file.

In [11]:
df_locations.to_pickle('./locations.pkl')    

### Foursquare

Now that we have our location candidates, let's use Foursquare API to get info on Bars in each neighborhood.

We're interested in venues in 'Nightlife Spot' category, but only those that are proper Bars - restaurants, wine bars, lounges etc. are not direct competitors so we don't care about those. 

So we will include in out list only venues that have 'Bar' (and related denominations) in category name, and we'll make sure to detect and include all the subcategories of specific 'Beer Bar' category, as we need info on Beer bars in the neighborhood.

Foursquare credentials are defined in hidden cell bellow.

In [12]:
# The code was removed by Watson Studio for sharing.

Your credentials:
CLIENT_ID: DUBP4QPTEZHLUFIXQINXG5SPF4UPCN3K2BKVNMLEJVVOUGS2
CLIENT_SECRET:ZJ1KQ23BFUAS5WYODKGUTDS215UPRTWCLXAYRKCOZEDEZ0RM


In [13]:
# Category IDs corresponding to Beer Bars were taken from Foursquare web site 
# (https://developer.foursquare.com/docs/resources/categories):

nightlife_category = '4d4b7105d754a06376d81259' # 'Root' category for all Nightlife Spot-related venues

beer_bar_categories = ['56aa371ce4b08b9a8d57356c', '4bf58dd8d48988d117941735', '50327c8591d4c4b30a586d5d']

def is_bar(categories, specific_filter=None):
    bar_words = ['bar', 'brewery', 'gastrobar', 'restobar', 'fuente', 'cantina']
    bar = False
    specific = False
    
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        
        for r in bar_words:
            if r in category_name:
                bar = True
                
        if 'restaurant' in category_name:
            bar = False
            
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            bar = True
            
    return bar, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Chile', '')
    address = address.replace(', Región Metropolitana', '')
    
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []

    return venues

In [14]:
# Let's now go over our neighborhood locations and get nearby Bars; 
# we'll also maintain a dictionary of all found Bars and all found Beer Bars

import pickle

def get_bars(lats, lons):
    bars = {}
    beer_bars = {}
    location_bars = []

    print('Obtaining venues around candidate locations:', end='')
    
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any bar 
        # (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, nightlife_category, foursquare_client_id, foursquare_client_secret, radius=350, limit=100)
        area_bars = []
        
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            
            is_bars, is_beer = is_bar(venue_categories, specific_filter=beer_bar_categories)
            
            if is_bars:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                bar = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_beer, x, y)
                
                if (venue_distance<=300):
                    area_bars.append(bar)
                bars[venue_id] = bar
                if is_beer:
                    beer_bars[venue_id] = bar
                    
        location_bars.append(area_bars)
        print(' .', end='')
        
    print(' done.')
    
    return bars, beer_bars, location_bars

# Try to load from local file system in case we did this before
bars = {}
beer_bars = {}
location_bars = []
loaded = False


# If load failed use the Foursquare API to get the data
if not loaded:
    bars, beer_bars, location_bars = get_bars(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('bars_350.pkl', 'wb') as f:
        pickle.dump(bars, f)
    with open('beer_bars_350.pkl', 'wb') as f:
        pickle.dump(beer_bars, f)
    with open('location_bars_350.pkl', 'wb') as f:
        pickle.dump(location_bars, f)
        

Obtaining venues around candidate locations:



 . . . . .



 .



 . . .



 . .



 . . .



 .



 .



 . .



 .



 .



 . .



 .



 .



 .



 . . . . . . .



 .



 . . . . . . . . . .



 . . .



 . . .



 .



 .



 . . . . . . . .



 . . . .



 . .



 . . .



 . . . . . . . . .



 . . .



 .



 . .



 .



 .



 .



 .



 .



 . . . .



 . . .



 . .



 .



 . . . .



 .



 .



 . .



 . . .



 . . .



 .



 .



 .



 .



 .



 .



 . . .



 . . .



 .



 .



 .



 .



 . . . . .



 .



 .



 .



 .



 .



 . .



 .



 .



 . .



 .



 . .



 .



 . . .



 .



 .



 .



 .



 .



 .



 .



 . .



 .



 . . .



 .



 . . .



 . . .



 .



 .



 .



 .



 .



 .



 .



 .



 . .



 .



 .



 . . .



 . . . . . .



 .



 .



 .



 .



 .



 .



 .



 .



 .



 . .



 .



 .



 .



 .



 .



 .



 . .



 .



 .



 .



 .



 .



 . .



 .



 .



 . . .



 . .



 . .



 .



 . . .



 .



 .



 .



 .



 .



 .



 .



 .



 .



 . . .



 . .



 .



 .



 . .



 .



 .



 .



 .



 .



 .



 .



 .



 .



 .



 .



 . .



 . .



 .



 .



 . .



 .



 .



 .



 .



 .



 .



 . . .



 .



 .



 .



 . . . . .



 .



 .



 .



 .



 .



 .



 .



 .



 . . .



 . .



 .



 . . . .



 . .



 . . .



 .



 .



 .



 .



 . .



 . . . . . .



 . . .



 .



 .



 .



 . . . .



 .



 .



 . .



 . . . .



 .



 .



 .



 .



 . .



 .



 . . . . .



 .



 .



 .



 .



 .



 .



 . . .



 .



 .



 .



 . . done.




In [15]:
import numpy as np

print('Total number of Bars:', len(bars))
print('Total number of Beer Bars:', len(beer_bars))
print('Percentage of Beer Bars: {:.2f}%'.format(len(beer_bars) / len(bars) * 100))
print('Average number of Bars in neighborhood:', np.array([len(r) for r in location_bars]).mean())

Total number of Bars: 495
Total number of Beer Bars: 91
Percentage of Beer Bars: 18.38%
Average number of Bars in neighborhood: 1.1675824175824177


In [16]:
print('List of all Bars')
print('-----------------------')

for r in list(bars.values())[:10]:
    print(r)
    
print('...')
print('Total:', len(bars))

List of all Bars
-----------------------
('4e6bedecb0fba3f50e188710', "Teppy's Shop", -33.50803949782967, -70.61381083290019, 'Av. Departamental 10, San Joaquín, Metropolitana de Santiago de Chile', 81, True, 350106.63217350846, -3708776.1086736857)
('50ecd7fae4b08979851c9ca8', 'Buba', -33.50824463127704, -70.61420586044085, 'Departamental (Al Ladito De Vicuña), Al Ladi Del Teppys', 101, False, 350070.29142956354, -3708799.4260001965)
('4fa4a219e4b00cad88a8af56', 'Botilleria Las Brisas', -33.50494928849517, -70.58105149647132, 'San Luis de Macul 5043 (Las Brisas), Peñalolén, Metropolitana de Santiago de Chile', 315, True, 353144.45850973, -3708386.599133439)
('4de6f6ecb0fb9a99f6f7cac2', 'E.Vivare', -33.50706714060497, -70.57507753372192, 'Los Cerezos 5938 (San Luis de Macul), Santiago de Chile, Metropolitana de Santiago de Chile', 88, False, 353702.9682311857, -3708612.9991319086)
('4dddd78045dd033c3922835b', 'Restobar Habla De Mi', -33.506164876716966, -70.57703703693551, 'avenida los

In [17]:
print('List of Beer Bars')
print('---------------------------')

for r in list(beer_bars.values())[:10]:
    print(r)
    
print('...')
print('Total:', len(beer_bars))

List of Beer Bars
---------------------------
('4e6bedecb0fba3f50e188710', "Teppy's Shop", -33.50803949782967, -70.61381083290019, 'Av. Departamental 10, San Joaquín, Metropolitana de Santiago de Chile', 81, True, 350106.63217350846, -3708776.1086736857)
('4fa4a219e4b00cad88a8af56', 'Botilleria Las Brisas', -33.50494928849517, -70.58105149647132, 'San Luis de Macul 5043 (Las Brisas), Peñalolén, Metropolitana de Santiago de Chile', 315, True, 353144.45850973, -3708386.599133439)
('5fb0875369730d500109d9a3', 'La Cerveceria', -33.501545, -70.59492, 'Macul, Metropolitana de Santiago de Chile', 308, True, 351850.3541219447, -3708028.8263742877)
('50ce2a45e4b0bd90a67473e9', 'el bigote', -33.49896342006, -70.62829884122485, 'Ureta Cox, Santiago de Chile, Metropolitana de Santiago de Chile', 189, True, 348745.02851021115, -3707790.718281669)
('50fadf3de4b07787cf51577f', 'botilleria Don Mario', -33.49979019165039, -70.62804412841797, 'Avenida Las Industias, Santiago de Chile, Metropolitana de S

In [18]:
print('Bars around location')
print('---------------------------')

for i in range(100, 110):
    rs = location_bars[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Bars around location {}: {}'.format(i+1, names))

Bars around location
---------------------------
Bars around location 101: 
Bars around location 102: 
Bars around location 103: Rodrigo de Araya 1263, El cocodrilo botilleria, Donde Nano
Bars around location 104: Shoperia Sobarzo, Donde Manolito
Bars around location 105: Shopería Biereck
Bars around location 106: 
Bars around location 107: coyote's place
Bars around location 108: 
Bars around location 109: 
Bars around location 110: k-zador


Let's now see all the collected bars in our area of interest on map, and let's also show Beer Bars in different color.

In [19]:
map_santiago = folium.Map(location=p_nunoa, zoom_start=13)
folium.Marker(p_nunoa, popup='Plaza Ñuñoa').add_to(map_santiago)

for res in bars.values():
    lat = res[2]; lon = res[3]
    is_beer = res[6]
    color = 'red' if is_beer else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_santiago)
    
map_santiago

Looking good. So now we have all the Bars in area within few kilometers from Plaza Ñuñoa, and we know which ones are Beer Bars! 
We also know which bars exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Beer Bar!

## Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of Santiago that have low Bar density, particularly those with low number of Beer Bars. We will limit our analysis to area ~6km around Plaza Ñuñoa.

In first step we have collected the required **data: location and type (category) of every bar within 6km from desired center** (Plaza Ñuñoa). We have also **identified Beer Bars** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**Bar density**' across different areas of Santiago - we will use **heatmaps** to identify a few promising areas close to Plaza Ñuñoa with low number of Bars in general (*and* no Beer Bars in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two Bars in radius of 250 meters**, and we want locations **without Beer Bars in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of Bars in every area candidate**:

In [20]:
location_bars_count = [len(bar) for bar in location_bars]
df_locations['Bars in area'] = location_bars_count

print('Average number of Bars in every area with radius=300m:', np.array(location_bars_count).mean())

df_locations.head(10)

Average number of Bars in every area with radius=300m: 1.1675824175824177


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Bars in area
0,"Vicuña Mackenna 5537, San Joaquín, Macul",-33.507336,-70.614073,350081.083034,-3708699.0,5992.495307,2
1,"Unnamed Road, Macul",-33.50742,-70.607615,350681.083034,-3708699.0,5840.3767,0
2,"Froilán Roa 5900, Macul",-33.507504,-70.601158,351281.083034,-3708699.0,5747.173218,0
3,"Av. Departamental 4400, Macul",-33.507587,-70.5947,351881.083034,-3708699.0,5715.767665,0
4,"Villa Gildemeister 4372, Penalolen, Peñalolén",-33.50767,-70.588242,352481.083034,-3708699.0,5747.173218,0
5,"Chillán 6092, Penalolen, Peñalolén",-33.507753,-70.581785,353081.083034,-3708699.0,5840.3767,0
6,"Los Cerezos 6046, Penalolen, Peñalolén",-33.507835,-70.575327,353681.083034,-3708699.0,5992.495307,2
7,"Piramide 237, San Joaquín",-33.502525,-70.623672,349181.083034,-3708179.0,5855.766389,0
8,"Emco 5080, San Joaquín",-33.502609,-70.617214,349781.083034,-3708179.0,5604.462508,0
9,"José Caroca 2087, Macul",-33.502694,-70.610757,350381.083034,-3708179.0,5408.326913,1


OK, now let's calculate the **distance to nearest Beer Bar from every area candidate center** (not only those within 300m - we want distance to closest one, regardless of how distant it is).

In [21]:
distances_to_beer_bar = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    
    for res in beer_bars.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        
        if d<min_distance:
            min_distance = d
            
    distances_to_beer_bar.append(min_distance)

df_locations['Distance to Beer Bar'] = distances_to_beer_bar

In [22]:
df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Bars in area,Distance to Beer Bar
0,"Vicuña Mackenna 5537, San Joaquín, Macul",-33.507336,-70.614073,350081.083034,-3708699.0,5992.495307,2,81.68101
1,"Unnamed Road, Macul",-33.50742,-70.607615,350681.083034,-3708699.0,5840.3767,0,579.666128
2,"Froilán Roa 5900, Macul",-33.507504,-70.601158,351281.083034,-3708699.0,5747.173218,0,878.958205
3,"Av. Departamental 4400, Macul",-33.507587,-70.5947,351881.083034,-3708699.0,5715.767665,0,670.40452
4,"Villa Gildemeister 4372, Penalolen, Peñalolén",-33.50767,-70.588242,352481.083034,-3708699.0,5747.173218,0,733.052224
5,"Chillán 6092, Penalolen, Peñalolén",-33.507753,-70.581785,353081.083034,-3708699.0,5840.3767,0,318.300159
6,"Los Cerezos 6046, Penalolen, Peñalolén",-33.507835,-70.575327,353681.083034,-3708699.0,5992.495307,2,620.696721
7,"Piramide 237, San Joaquín",-33.502525,-70.623672,349181.083034,-3708179.0,5855.766389,0,506.975224
8,"Emco 5080, San Joaquín",-33.502609,-70.617214,349781.083034,-3708179.0,5604.462508,0,680.167088
9,"José Caroca 2087, Macul",-33.502694,-70.610757,350381.083034,-3708179.0,5408.326913,1,657.242953


In [23]:
print('Average distance to closest Beer Bar from each area center:', df_locations['Distance to Beer Bar'].mean())

Average distance to closest Beer Bar from each area center: 751.3431254925061


OK, so **on average Beer bar can be found within ~750m** from every area center candidate. It is a reasonable distance, therefore we will have some flexibility in choosing the neighborhoods.

Let's crete a map showing **heatmap / density of Bars** and try to extract some meaningfull info from that. Also, let's show a few circles on our map indicating distance of 1km, 2km and 3km from Plaza Ñuñoa.

In [24]:
bar_latlons = [[bar[2], bar[3]] for bar in bars.values()]
beer_latlons = [[beer[2], beer[3]] for beer in beer_bars.values()]

In [25]:
from folium import plugins
from folium.plugins import HeatMap

map_santiago = folium.Map(location=p_nunoa, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_santiago) #cartodbpositron cartodbdark_matter
HeatMap(bar_latlons).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
folium.Circle(p_nunoa, radius=1000, fill=False, color='white').add_to(map_santiago)
folium.Circle(p_nunoa, radius=2000, fill=False, color='white').add_to(map_santiago)
folium.Circle(p_nunoa, radius=3000, fill=False, color='white').add_to(map_santiago)

map_santiago

There exists some locations with very low Bar density (or not information at all). Most of these locations are mainly residential areas and therefore we should avoid them. There exist very high density pockets just in Plaza Ñuñoa and to the north west. It could be recognized some low pockets of low Bar density to the **north-west and east from Plaza Ñuñoa**. 

Let's create another heatmap map showing **heatmap/density of Beer bars** only.

In [26]:
map_santiago = folium.Map(location=p_nunoa, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_santiago) #cartodbpositron cartodbdark_matter
HeatMap(beer_latlons).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
folium.Circle(p_nunoa, radius=1000, fill=False, color='white').add_to(map_santiago)
folium.Circle(p_nunoa, radius=2000, fill=False, color='white').add_to(map_santiago)
folium.Circle(p_nunoa, radius=3000, fill=False, color='white').add_to(map_santiago)

map_santiago

This map is not so 'hot' (Beer bars represent a subset of ~18% of all bars in Santiago) but it also indicates higher density of existing Beer bars directly in and west Plaza Ñuñoa, with closest pockets of **low Beer bars density positioned north-west and east from desired center**.

Based on this we will now focus our analysis on areas *north-west from Plaza Ñuñoa* - we will move the center of our area of interest and reduce it's size to have a radius of **2.5km**. 
This places our location candidates mostly in the borough of **Providencia**.

### Providencia

Analysis of popular travel guides and web sites often mention Providencia as beautifull, interesting, Santiago neighbourhoods. Also, these are Santiago boroughs with a higher income than other city areas, therefore they could be more attractive for the stake holders.

Also they are relatively close to Plaza Ñuñoa and well connected, those boroughs appear to justify further analysis.

Let's define new, more narrow region of interest, which will include low-bar-count parts of Providencia closest to Plaza Ñuñoa.

In [32]:
roi_x_min = p_nunoa_x - 3000
roi_y_max = p_nunoa_y + 5000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 2500
roi_center_y = roi_y_max - 2500
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_santiago = folium.Map(location=roi_center, zoom_start=14)
HeatMap(bar_latlons).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_santiago)

map_santiago



Not bad - this nicely covers pockets of low Bar density in Providencia closest to Plaza Ñuñoa. Also we will create a more dense grid of location candidates restricted to the new region of interest.
Our location candidates will be 100m appart.

In [33]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')



2261 candidate neighborhood centers generated.


OK. Now let's calculate two most important things for each location candidate: **number of Bars in vicinity** (we'll use radius of **250 meters**) and **distance to closest Beer Bar**.

In [34]:
def count_bars_nearby(x, y, bars, radius=250):    
    count = 0
    
    for bar in bars.values():
        bar_x = bar[7]; bar_y = bar[8]
        d = calc_xy_distance(x, y, bar_x, bar_y)
        if d<=radius:
            count += 1
            
    return count

def find_nearest_bar(x, y, bars):
    d_min = 100000
    
    for bar in bars.values():
        bar_x = bar[7]; bar_y = bar[8]
        d = calc_xy_distance(x, y, bar_x, bar_y)
        
        if d<=d_min:
            d_min = d
            
    return d_min

roi_bar_counts = []
roi_beer_distances = []

print('Generating data on location candidates... ', end='')

for x, y in zip(roi_xs, roi_ys):
    
    count = count_bars_nearby(x, y, bars, radius=250)
    roi_bar_counts.append(count)
    distance = find_nearest_bar(x, y, beer_bars)
    roi_beer_distances.append(distance)
print('done.')


Generating data on location candidates... done.


In [35]:
# Let's put this into dataframe
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Bars nearby':roi_bar_counts,
                                 'Distance to Beer Bar':roi_beer_distances})

df_roi_locations.head(10)

Unnamed: 0,Latitude,Longitude,X,Y,Bars nearby,Distance to Beer Bar
0,-33.455977,-70.599672,351331.083034,-3702983.0,6,193.194819
1,-33.455991,-70.598596,351431.083034,-3702983.0,6,155.941749
2,-33.455119,-70.605573,350781.083034,-3702896.0,1,532.257496
3,-33.455133,-70.604498,350881.083034,-3702896.0,1,447.655231
4,-33.455147,-70.603422,350981.083034,-3702896.0,0,370.799635
5,-33.455161,-70.602346,351081.083034,-3702896.0,0,307.554115
6,-33.455175,-70.601271,351181.083034,-3702896.0,1,267.743716
7,-33.455189,-70.600195,351281.083034,-3702896.0,6,215.340313
8,-33.455203,-70.59912,351381.083034,-3702896.0,7,122.14789
9,-33.455217,-70.598044,351481.083034,-3702896.0,9,58.896209


OK. Let us now **filter** those locations: we're interested only in **locations with no more than two Bars in radius of 250 meters**, and **no Beer bars in radius of 400 meters**.

In [36]:
good_bar_count = np.array((df_roi_locations['Bars nearby']<=2))
print('Locations with no more than two Bars nearby:', good_bar_count.sum())

good_beer_distance = np.array(df_roi_locations['Distance to Beer Bar']>=400)
print('Locations with no Beer Bars within 400m:', good_beer_distance.sum())

good_locations = np.logical_and(good_bar_count, good_beer_distance)
print('Locations with both conditions met:', good_locations.sum())

df_good_locations = df_roi_locations[good_locations]


Locations with no more than two Bars nearby: 1895
Locations with no Beer Bars within 400m: 1738
Locations with both conditions met: 1557


Let's see how this looks on a map.

In [37]:
good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values

good_locations = [[lat, lon] for lat, lon in zip(good_latitudes, good_longitudes)]

map_santiago = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_santiago)
HeatMap(bar_latlons).add_to(map_santiago)
folium.Circle(roi_center, radius=1500, color='white', fill=True, fill_opacity=0.6).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_santiago) 

map_santiago

Looking good. We now have a bunch of locations fairly close to Plaza Ñuñoa (mostly in Providencia boroughs), and we know that each of those locations has no more than two bars in radius of 250m, and no Beer bar closer than 400m. Any of those locations is a potential candidate for a new Beer bar, at least based on nearby competition.

Let's now show those good locations in a form of heatmap:

In [38]:
map_santiago = folium.Map(location=roi_center, zoom_start=14)
HeatMap(good_locations, radius=25).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_santiago)

map_santiago

Looking good. What we have now is a clear indication of zones with low number of Bars in vicinity, and *no* Beer bars at all nearby.

Let us now **cluster** those locations to create **centers of zones containing good locations**. Those zones, their centers and addresses will be the final result of our analysis. 

In [39]:
from sklearn.cluster import KMeans

number_of_clusters = 15

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_santiago = folium.Map(location=roi_center, zoom_start=15)
folium.TileLayer('cartodbpositron').add_to(map_santiago)
HeatMap(bar_latlons).add_to(map_santiago)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_santiago)
folium.Marker(p_nunoa).add_to(map_santiago)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_santiago) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_santiago)

map_santiago



Not bad - our clusters represent groupings of most of the candidate locations and cluster centers are placed nicely in the middle of the zones 'rich' with location candidates.

Addresses of those cluster centers will be a good starting point for exploring the neighborhoods to find the best possible location based on neighborhood specifics.

Let's see those zones on a city map without heatmap, using shaded areas to indicate our clusters:

In [41]:
map_santiago = folium.Map(location=roi_center, zoom_start=14)

folium.Marker(p_nunoa).add_to(map_santiago)

for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#00000000', fill=True, fill_color='#0066ff', fill_opacity=0.07).add_to(map_santiago)
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_santiago)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=False).add_to(map_santiago) 

map_santiago

Finaly, let's **reverse geocode those candidate area centers to get the addresses** which can be presented to stakeholders.

In [42]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(google_api_key, lat, lon).replace(', Chile', '')
    addr = addr.replace(', Región Metropolitana', '')

    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, p_nunoa_x, p_nunoa_y)
    print('{}{} => {:.1f}km from Plaza Ñuñoa'.format(addr, ' '*(50-len(addr)), d/1000))
    

Addresses of centers of areas recommended for further analysis





Av. Manuel Montt 2051, Providencia                 => 2.2km from Plaza Ñuñoa
Holanda 1709, Providencia                          => 2.5km from Plaza Ñuñoa




Av. Américo Vespucio Sur 919, Las Condes           => 4.0km from Plaza Ñuñoa




Rossemblut 4847, Ñuñoa                             => 1.9km from Plaza Ñuñoa




Av. Andrés Bello 2600, Providencia, Las Condes     => 4.7km from Plaza Ñuñoa




Jorge Matte Gormaz 2436, Providencia               => 2.0km from Plaza Ñuñoa
Simón Bolívar 2393, Ñuñoa                          => 1.3km from Plaza Ñuñoa




Holanda 902, Providencia                           => 3.3km from Plaza Ñuñoa




Padre Mariano 239, Providencia                     => 4.1km from Plaza Ñuñoa
Pedro Torres 758, Ñuñoa                            => 1.0km from Plaza Ñuñoa




Tarragona 3622, Las Condes                         => 3.1km from Plaza Ñuñoa




Mariano Sánchez Fontecilla 5512, La Reina          => 2.7km from Plaza Ñuñoa
Av. Los Leones 2595, Providencia                   => 1.6km from Plaza Ñuñoa




Gertrudis Echenique 96, Las Condes                 => 4.3km from Plaza Ñuñoa
Román Díaz 938, Providencia                        => 3.1km from Plaza Ñuñoa




This concludes our analysis. We have created 15 addresses representing centers of zones containing locations with low number of Bars and no Beer Bars nearby, all zones being fairly close to Plaza Ñuñoa (all less than 5km from this location). Although zones are shown on map with a radius of ~500 meters (green circles), their shape is actually very irregular and their centers/addresses should be considered only as a starting point for exploring area neighborhoods in search for potential restaurant locations. Most of the zones are located in Providencia borough, which we have identified as interesting due to being popular with tourists, fairly close to Plaza Ñuñoa and well connected by public transport.

In [44]:
map_santiago = folium.Map(location=roi_center, zoom_start=15)
folium.Circle(p_nunoa, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_santiago)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_santiago)
    
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.Circle([lat, lon], radius=250, color='#0000ff00', fill=True, fill_color='#0066ff', fill_opacity=0.05).add_to(map_santiago)
    
map_santiago

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a relatively high number of Bars in Santiago (~500 in our initial area of interest which was 12x12km around Plaza Ñuñoa), there are significant areas of low Bar density fairly close to Plaza Ñuñoa. Highest concentration of Bars was detected just in and west from Plaza Ñuñoa, so we focused our attention to areas north and north-west, corresponding to the borough of Providencia. We focus or attention to Providencia, which offers a combination of popularity among tourists, closeness to Plaza Ñuñoa, strong socio-economic dynamics *and* and significant areas of low Bar density.

After directing our attention to this more narrow area of interest (covering approx. 5x5km north-west from Plaza Ñuñoa) we first created a dense grid of location candidates (spaced 100m appart); those locations were then filtered so that those with more than two Bars in radius of 250m and those with a Beer Bar closer than 400m were removed.

Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.

Result of all this is 15 zones containing largest number of potential new Bar locations based on number of and distance to existing venues - both Bars in general and Beer Bars particularly. This, of course, does not imply that those zones are actually optimal locations for a new Bar! Purpose of this analysis was to only provide info on areas close to Plaza Ñuñoa center but not crowded with existing Bars (particularly Beer ones) - it is entirely possible that there is a very good reason for small number of Bars in any of those areas (for example, that some of these areas are residential ones, and therefore not suitable for a Bar), reasons which would make them unsuitable for a new Bar regardless of lack of competition in the area. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition but also other factors taken into account and all other relevant conditions met.

## Conclusion <a name="conclusion"></a>

Purpose of this project was to identify Santiago areas close to Plaza Ñuñoa with low number of Bars (particularly Beer Bars) in order to aid stakeholders in narrowing down the search for optimal location for a new Beer Bar. By calculating Bar density distribution from Foursquare data we have first identified general boroughs that justify further analysis (Providencia), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby Bars. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal Bar location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.