# Capstone Project - The Battle of the Neighborhoods (Week 2)
Applied Data Science Capstone by IBM/Coursera

# Acknowledgment

<p>This project is based on a sample project that this course has introduced (<a href="https://cocl.us/coursera_capstone_notebook">Ref.</a>).</p>

# Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



# Introduction: Business Problem <a name="introduction"></a>

Ho Chi Minh City (Saigon) is the business and financial hub of Vietnam. The population of HCM City in 2019 was put at 9.0 million people. The area is 2,095,239 km2 with 24 districts. The city develops and modernizes key sectors, namely trading, import-export, finance and banking, insurance, tourism, telecommunications, science and technology, and services for trading and production in HCM City and southern provinces. Today, Ho Chi Minh City is a popular tourist destination due to the fact that the weather is warm, fascinating culture, sleek skyscrapers, ornate temples, and pagodas. The city is also filled with bars, coffice shops, restaurants that overlook Saigon and beyond, while fantastic restaurants offer local Vietnamese cuisine. The city has contributed the largest budget in the country, dubbed the most livable city in Vietnam.

In this project, we will try to find an **optimal location** for a **coffee shop**. Specifically, this report will be targeted to stakeholders interested in opening a coffee shop in the center of Ho Chi Minh City. This project will address the following 3 issues:

Firstly, segmenting and Clustering Neighborhoods in Ho Chi Minh City. After segmenting and clustering, we locate the center. This location is located between Tan Son Nhat International Airport and the city center (District 1). This is the direction visitors will move when traveling to Vietnam and the ability to find a cafe. Next, using the Foursquare API to explore neighborhoods in the City.

Secondly, finding reasonable areas to open a coffee shop. Because there are many coffee shops in Ho Chi Minh City, we will try to detect locations that are not already crowded with coffee shops. And, we are also particularly interested in areas with no cafe in the vicinity. We would also prefer locations as close to the city center as possible.

Thirdly, after identifying a reasonable area, we analyze the data to find locations for opening the coffee shop in the most economical way.

# Data Description <a name="data"></a>

Based on the definition of the above problem, data factors that will influence this project as follows:

- number of existing coffee shops in the area
- number of and distance of coffee shops each other in the area
- a distance of the area from the central location (This location is located between Tan Son Nhat International Airport and the city center).

We will create a grid of cells covering our area of interest which is approx. The radius of 12 kilometers centered around the center location that we defined. 

We use a regularly spaced grid of locations, centered around the central location, to define areas.

Following data sources will be needed to extract/generate the required information:

- centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained. 

- using Google Maps API reverse geocoding number of the coffee shops and their type and location in every area will be obtained using Foursquare API.

- coordinate of the central location will be obtained using Google Maps API geocoding.

# Area Candidates

Let's create latitude & longitude coordinates for centroids of candidate areas. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around central location.

Let's first find the latitude & longitude of the central location, using specific, well known address and Google Maps geocoding API.

In [1]:
import requests

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
    
address = 'Ủy ban nhân dân Tp.HCM, Lê Thánh Tôn, Phường Bến Nghé, Quận 1, Hồ Chí Minh, Việt Nam'
address2 = 'Sân bay quốc tế Tân Sơn Nhất, Trường Sơn, Phường 2, Quận Tân Bình, Hồ Chí Minh, Việt Nam'
google_api_key = 'AIzaSyAQWqMTOcyLBRDR2skO4F_5QEWzNDOlUHw'

hcmc_center1 = get_coordinates(google_api_key, address)
hcmc_center2 = get_coordinates(google_api_key, address2)

hcmc_center = []
hcmc_center.append((hcmc_center1[0] + hcmc_center2[0])/2)
hcmc_center.append((hcmc_center1[1] + hcmc_center2[1])/2)

print('Coordinate of {}: {}'.format('The central location of Ho Chi Minh City', hcmc_center))

Coordinate of The central location of Ho Chi Minh City: [10.79751205, 106.67989825]


In [2]:
!pip install shapely
!pip install pyproj



In [3]:
import shapely.geometry
import pyproj
import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')

print('Cenrral location longitude = {}, latitude = {}'.format(hcmc_center[1], hcmc_center[0]))
x, y = lonlat_to_xy(hcmc_center[1], hcmc_center[0])

print('Cenrral location UTM X = {}, Y = {}'.format(x, y))
lo, la = xy_to_lonlat(x, y)

print('Cenrral location longitude = {}, latitude = {}'.format(lo, la))

Coordinate transformation check
-------------------------------
Cenrral location longitude = 106.67989825, latitude = 10.79751205
Cenrral location UTM X = 15220787.389165862, Y = 10899793.552625168
Cenrral location longitude = 106.67993419927984, latitude = 10.797525700662865


Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's areas**.

In [4]:
hcmc_center_x, hcmc_center_y = lonlat_to_xy(
    hcmc_center[1], hcmc_center[0])  # City center in Cartesian coordinates

(hcmc_center_x, hcmc_center_y)

(15220787.389165862, 10899793.552625168)

In [5]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = hcmc_center_x - 12000 #6000
x_step = 600
y_min = hcmc_center_y - 12000 #- (int(21/k)*k*600 - 12000)/2
y_step = 600 * k

(k, x_min, x_step, y_min, y_step)

(0.8660254037844386,
 15208787.389165862,
 600,
 10887793.552625168,
 519.6152422706632)

In [6]:
latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []


for i in range(0, int(41/k)):
    y = y_min + i * y_step
    x_offset = 300 #if i%2==0 else 0
    
    for j in range(0, 41):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(hcmc_center_x, hcmc_center_y, x, y)
        if (distance_from_center <= 12001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate area centers generated.')

1452 candidate area centers generated.


Let's visualize the data we have so far: city center location and candidate areas centers:

In [7]:
#!pip install folium

import folium

In [8]:
map_hcmc = folium.Map(location = hcmc_center, zoom_start=13)
folium.Marker(hcmc_center, popup='Central location').add_to(map_hcmc)


for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_hcmc) 
    folium.Circle([lat, lon], radius=60, color='red', fill=False).add_to(map_hcmc)
    #folium.Marker([lat, lon]).add_to(map_hcmc)
map_hcmc



OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from central location. 

Let's now use Google Maps API to get approximate addresses of those locations.

In [9]:

def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key, hcmc_center[0], hcmc_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(hcmc_center[0], hcmc_center[1], addr))

Reverse geocoding check
-----------------------
Address of [10.79751205, 106.67989825] is: 3 Trần Khắc Trân, Phường 15, Phú Nhuận, Hồ Chí Minh, Vietnam


In [10]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', Vietnam', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [11]:
len(addresses)

1452

In [12]:
addresses

['30 Phan Thúc Duyện, Phường 4, Tân Bình, Hồ Chí Minh',
 '6/22 Hẻm số 6 Đồ Sơn, Phường 4, Tân Bình, Hồ Chí Minh',
 '27 Sầm Sơn, Phường 4, Tân Bình, Hồ Chí Minh',
 '36/14 Hẻm 36 Giải Phóng, Phường 4, Tân Bình, Hồ Chí Minh',
 '2/15 Hẻm số 2 Đồng Khởi, Phường 4, Tân Bình, Hồ Chí Minh',
 '15 Cộng Hòa, Phường 4, Tân Bình, Hồ Chí Minh',
 '51 Nguyễn Thái Bình, Phường 4, Tân Bình, Hồ Chí Minh',
 '308/5 Hoàng Văn Thụ, Phường 4, Tân Bình, Hồ Chí Minh',
 '40a, 40B Út Tịch, Phường 4, Tân Bình, Hồ Chí Minh',
 '437/13- 437/15 Hoàng Văn Thụ, Phường 4, Tân Bình, Hồ Chí Minh',
 '27 Nguyễn Đình Khơi, Phường 4, Tân Bình, Hồ Chí Minh',
 '61 Nguyễn Đình Khơi, Phường 4, Tân Bình, Hồ Chí Minh',
 '130 Thăng Long, Phường 4, Tân Bình, Hồ Chí Minh',
 'Nhà ga hàng hóa quốc tế Sân bay Tân Sơn Nhất, Nguyễn Văn Vĩnh, Phường 4, Tân Bình, Hồ Chí Minh',
 '6bis Thăng Long, Phường 4, Tân Bình, Hồ Chí Minh',
 '28, Đường Phan Thúc Duyện, Phường 4, Tân Bình, Hồ Chí Minh',
 '3 Hẻm số 3 Đồ Sơn, Phường 4, Tân Bình, Hồ Chí Minh

Looking good. Let's now place all this into a Pandas dataframe.

In [13]:
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"30 Phan Thúc Duyện, Phường 4, Tân Bình, Hồ Chí...",10.806386,106.659513,15217490.0,10888310.0,11945.259904
1,"6/22 Hẻm số 6 Đồ Sơn, Phường 4, Tân Bình, Hồ C...",10.805288,106.659364,15218090.0,10888310.0,11793.609888
2,"27 Sầm Sơn, Phường 4, Tân Bình, Hồ Chí Minh",10.80419,106.659216,15218690.0,10888310.0,11670.871184
3,"36/14 Hẻm 36 Giải Phóng, Phường 4, Tân Bình, H...",10.803092,106.659068,15219290.0,10888310.0,11577.9633
4,"2/15 Hẻm số 2 Đồng Khởi, Phường 4, Tân Bình, H...",10.801995,106.658919,15219890.0,10888310.0,11515.608286
5,"15 Cộng Hòa, Phường 4, Tân Bình, Hồ Chí Minh",10.800897,106.658771,15220490.0,10888310.0,11484.303818
6,"51 Nguyễn Thái Bình, Phường 4, Tân Bình, Hồ Ch...",10.7998,106.658623,15221090.0,10888310.0,11484.303818
7,"308/5 Hoàng Văn Thụ, Phường 4, Tân Bình, Hồ Ch...",10.798702,106.658475,15221690.0,10888310.0,11515.608286
8,"40a, 40B Út Tịch, Phường 4, Tân Bình, Hồ Chí Minh",10.797605,106.658326,15222290.0,10888310.0,11577.9633
9,"437/13- 437/15 Hoàng Văn Thụ, Phường 4, Tân Bì...",10.796508,106.658178,15222890.0,10888310.0,11670.871184


In [14]:
df_locations.sort_values(by=['Distance from center'])

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
717,"48D Trần Khắc Trân, Phường 15, Phú Nhuận, Hồ C...",10.798086,106.679919,1.522049e+07,1.089974e+07,303.951092
718,"18 Nguyễn Đình Chính, Phường 15, Phú Nhuận, Hồ...",10.796989,106.679769,1.522109e+07,1.089974e+07,303.951092
758,"11/12 Nguyễn Trọng Tuyển, Phường 15, Phú Nhuận...",10.796861,106.680730,1.522109e+07,1.090026e+07,558.229748
757,"318 Phan Đình Phùng, Phường 1, Phú Nhuận, Hồ C...",10.797958,106.680880,1.522049e+07,1.090026e+07,558.229748
678,"91 Nguyễn Trọng Tuyển, Phường 15, Phú Nhuận, H...",10.797118,106.678808,1.522109e+07,1.089923e+07,642.769073
677,"33D Trần Khắc Trân, Phường 15, Phú Nhuận, Hồ C...",10.798215,106.678958,1.522049e+07,1.089923e+07,642.769073
716,"2 Hoàng Văn Thụ, Phường 9, Phú Nhuận, Hồ Chí Minh",10.799184,106.680069,1.521989e+07,1.089974e+07,901.324729
719,"130/3 Nguyễn Đình Chính, Phường 15, Phú Nhuận,...",10.795892,106.679619,1.522169e+07,1.089974e+07,901.324729
756,"215D/12 Đoàn Thị Điểm, Phường 1, Phú Nhuận, Hồ...",10.799055,106.681030,1.521989e+07,1.090026e+07,1015.687182
759,"42/3 Duy Tân, Phường 15, Phú Nhuận, Hồ Chí Minh",10.795763,106.680580,1.522169e+07,1.090026e+07,1015.687182


...and let's now save/persist this data into local file.

In [15]:
df_locations.to_pickle('./locations.pkl')    

In [16]:
df_locations.shape

(1452, 6)

# Foursquare
  
Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each area.

We're interested in venues in 'food' category (include coffee shops, cafes). We will include in our list only venues that have 'coffee' in the category name.

In [17]:
# Category IDs corresponding to coffee shop were taken from Foursquare web site 
#(https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

coffee_shop_categories = ['4bf58dd8d48988d16d941735',
                                 '4bf58dd8d48988d1e0931735',
                                 '4bf58dd8d48988d1f0941735',
                                 '4bf58dd8d48988d1e0931735',
                                 '5665c7b9498e7d8a4f2c0f06',
                                 '54f4ba06498e2cf5561da814',
                                 '4bf58dd8d48988d18d941735',
                                ]

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    #address = address.replace(', Deutschland', '')
    address = address.replace(', Vietnam', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [18]:
#pthson@gmail.com
foursquare_client_id = 'L41NSWBFCRZHXFAIEDP2Y0ZIK22PSCH3HHI0UYNVWNK31MR3' # your Foursquare ID
foursquare_client_secret = 'I2WEPXFJG113SFJYPGTFGMKIYTBUJBMO2L4QTKSM50GTZ4X2' # your Foursquare Secret

In [19]:
import pickle

In [20]:
def get_restaurants(lats, lons):
    restaurants = {}
    coffee_shops = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(
            lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius=700, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_coffee = is_restaurant(
                venue_categories, specific_filter=coffee_shop_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (
                    venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_coffee, x, y)
                if venue_distance <= 300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_coffee:
                    coffee_shops[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, coffee_shops, location_restaurants

In [21]:
# Try to load from local file system in case we did this before
restaurants = {}
coffee_shops = {}
location_restaurants = []
loaded = False

try:
    with open('restaurants.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('coffee_shops.pkl', 'rb') as f:
        coffee_shops = pickle.load(f)
    with open('location_restaurants.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Drinking shop data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, coffee_shops, location_restaurants = get_restaurants(
        latitudes, longitudes)

    # Let's persists this in local file system
    with open('restaurants.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('coffee_shops.pkl', 'wb') as f:
        pickle.dump(coffee_shops, f)
    with open('location_restaurants.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)

Drinking shop data loaded.


In [22]:
import numpy as np

print('Total number of Drinking shop:', len(restaurants))
print('Total number of Coffee shop:', len(coffee_shops))
print('Percentage of Coffee shop: {:.2f}%'.format(len(coffee_shops) / len(restaurants) * 100))
print('Average number of Drinking shops in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of Drinking shop: 488
Total number of Coffee shop: 164
Percentage of Coffee shop: 33.61%
Average number of Drinking shops in neighborhood: 2.34297520661157


In [23]:
print('List of all Drinking shops')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

List of all Drinking shops
-----------------------
('4ee7628d0aaf8939528f03c0', 'Hải sản Rạn Biển', 10.804503776759306, 106.66135831868787, 'Phan Thúc Duyện (Thăng Long), Hồ Chí Minh, Việt Nam', 695, False, 15218357.587846743, 10889446.096280014)
('544b2fed498e45ba2fce60c1', '한솔 - HanSol Korean Restaurant', 10.80768675612086, 106.66169972835752, 'Thang Long (Tan Binh), Thành phố Hồ Chí Minh, Thành phố Hồ Chí Minh, Việt Nam', 698, False, 15216624.595223978, 10889398.698742615)
('5299d49c11d2ce1a3b834225', 'GoGi House Superbowl', 10.808581665693495, 106.6644974467835, 'A43 Truong Son St., Tan Binh Dist., Thành phố Hồ Chí Minh, Thành phố Hồ Chí Minh, Việt Nam', 686, False, 15215945.48641174, 10890818.989033066)
('4c0f91afd64c0f4792ec295d', 'Thai House', 10.807818659834261, 106.66353061454383, '21B Hau Giang St., Tan Binh Dist., Thành phố Hồ Chí Minh, Thành phố Hồ Chí Minh, Việt Nam', 682, False, 15216423.741718374, 10890360.815515963)
('5486ed4b498e985cf50d8d6a', 'Phở Hùng', 10.8080185551

In [24]:
print('List of Coffee shops')
print('---------------------------')
for r in list(coffee_shops.values())[:10]:
    print(r)
print('...')
print('Total:', len(coffee_shops))

List of Coffee shops
---------------------------
('594f4106db1d81557e741b88', 'The Coffee House Thăng Long', 10.802981, 106.66062, '51 Thăng Long (Quận Tân Bình), Thành phố Hồ Chí Minh, Thành phố Hồ Chí Minh, Việt Nam', 692, True, 15219227.65552291, 10889163.524710394)
('5110a2d7e4b0c512bbbb590e', 'Highlands Coffee', 10.800969054162756, 106.65785871609506, '15-18 Cong Hoa St., Tan Binh Dist., Thành phố Hồ Chí Minh, Thành phố Hồ Chí Minh, Việt Nam', 675, True, 15220504.054521177, 10887841.820892932)
('4e85650f775b9667badb867b', 'Monaco', 10.804765, 106.665007, 'Thành phố Hồ Chí Minh, Việt Nam', 675, True, 15217957.961512249, 10891364.010292241)
('501f598ce4b0373b151db3c7', 'Café BALI', 10.808274416526773, 106.66521745470263, '1B Tiền Giang, Phường 2, Quận Tân Bình, Việt Nam', 678, True, 15216059.184258165, 10891223.139634041)
('4df2d11352b100c2d7f8b533', 'Tuấn Ngọc Cafe', 10.798723332483378, 106.65841437512873, '308/13 Hoang Van Thu, Ward 4, Tan Binh (Ut Tich), Thành phố Hồ Chí Minh, Th

In [25]:
pd.DataFrame(coffee_shops).T

Unnamed: 0,0,1,2,3,4,5,6,7,8
594f4106db1d81557e741b88,594f4106db1d81557e741b88,The Coffee House Thăng Long,10.803,106.661,"51 Thăng Long (Quận Tân Bình), Thành phố Hồ Ch...",692,True,1.52192e+07,1.08892e+07
5110a2d7e4b0c512bbbb590e,5110a2d7e4b0c512bbbb590e,Highlands Coffee,10.801,106.658,"15-18 Cong Hoa St., Tan Binh Dist., Thành phố ...",675,True,1.52205e+07,1.08878e+07
4e85650f775b9667badb867b,4e85650f775b9667badb867b,Monaco,10.8048,106.665,"Thành phố Hồ Chí Minh, Việt Nam",675,True,1.5218e+07,1.08914e+07
501f598ce4b0373b151db3c7,501f598ce4b0373b151db3c7,Café BALI,10.8083,106.665,"1B Tiền Giang, Phường 2, Quận Tân Bình, Việt Nam",678,True,1.52161e+07,1.08912e+07
4df2d11352b100c2d7f8b533,4df2d11352b100c2d7f8b533,Tuấn Ngọc Cafe,10.7987,106.658,"308/13 Hoang Van Thu, Ward 4, Tan Binh (Ut Tic...",689,True,1.52217e+07,1.08883e+07
573eab93498ed93baa3687b0,573eab93498ed93baa3687b0,Highland Coffee @Pico Plaza,10.8014,106.653,"Cộng Hoà, HCM, Việt Nam",669,True,1.52206e+07,1.08852e+07
4d91d7589acaa143cc55f0f0,4d91d7589acaa143cc55f0f0,Cafe Cõi Riêng,10.799,106.664,"334A, Nguyễn Trọng Tuyển, Hochiminh, Thành phố...",694,True,1.52211e+07,1.08914e+07
4daab1088154abafc2b28ecd,4daab1088154abafc2b28ecd,LUMOS Café,10.7986,106.664,"383 Nguyễn Trọng Tuyển St., Phu Nhuan Dist., T...",682,True,1.52213e+07,1.08912e+07
59008fcada5ede037e0d816d,59008fcada5ede037e0d816d,GongCha Nguyen Thai Binh,10.8027,106.657,"59 (Nguyễn Thái Bình), Tan Binh, Thành phố Hồ ...",662,True,1.52197e+07,1.08873e+07
51b99514498e541eebf74af8,51b99514498e541eebf74af8,Little Story Coffee,10.8005,106.653,"01 Lê Trung Nghĩa, P12, Tân Bình, Việt Nam",683,True,1.52211e+07,1.0885e+07


In [26]:
print('Drinking shops around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Drinking shops around location {}: {}'.format(i+1, names))

Drinking shops around location
---------------------------
Drinking shops around location 101: GoGi House Superbowl, Phở Hùng, Thai House, Chef Mamma, Bánh Cuốn Xuân Hường, iSushi Superbowl, Tokyo Deli Truong Son, Món Huế Restaurant
Drinking shops around location 102: Phở Hùng, Ốc Út Liên, Thai House, Chef Mamma, Bánh Cuốn Xuân Hường, iSushi Superbowl, Tokyo Deli Truong Son, Thien Huong Sushi
Drinking shops around location 103: Ốc Út Liên, Bánh Cuốn Xuân Hường, Tokyo Deli Truong Son, Monaco, Pho 24, Hai san Ut Lien, Biển Đông Restaurant
Drinking shops around location 104: Ốc Út Liên, Monaco, Biển Đông Restaurant, Pho 24, Hai san Ut Lien
Drinking shops around location 105: Bun Cha Xuan Tu, Bún Chả Xuân Tú, Biển Đông Restaurant, Monaco
Drinking shops around location 106: Bun Cha Xuan Tu, Cafe Cõi Riêng, Bún Chả Xuân Tú, Vườn Phố Restaurant, Secret Garden Tea and Pastries
Drinking shops around location 107: Bun Cha Xuan Tu, Cafe Cõi Riêng, LUMOS Café, Bún Chả Xuân Tú, Ốc Châu, Nhà hàng Qu

In [27]:
map_hcmc = folium.Map(location=hcmc_center, zoom_start=13)
folium.Marker(hcmc_center, popup='Central location').add_to(map_hcmc)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_coffee = res[6]
    color = 'red' if is_coffee else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_hcmc)
map_hcmc

Looking good. So now we have all the coffee shop in area within few kilometers from the central location.

# Methodology <a name="methodology"></a>

In this project, we will find and detect areas of the city that have low density of the coffee shop, particularly those with low number of coffee shop. We will limit our analysis to an area about 12 km around the central location. That is the position between Tan Son Nhat International Airport and the central city (coordinates taken at the People's Committee of District 1). The method of implementation consists of three stages as follows:  

- Stage #1: Collecting the required data: location and type of every restaurant within 7 km from the central location that we have defined above. According to Foursquare about categorizations of the coffee shops. We search the location of the coordinates of the cafes in the same mutual area.  

- Stage #2: After obtaining the data collected from stage #1, we proceed to the analysis phase. The initial analysis process will data exploration, such as the density of the coffee shop across different areas. Using heatmaps to identify a few promising areas close to center with the low number of coffee shops in general and focus our attention on those areas.

- Stage #3: Focus on most promising areas and within those create clusters of locations that meet some basic requirements established in discussion with stakeholders: (1) Take into consideration locations with no more than two the coffee shop in a radius of 400 meters, and we want locations without the coffee shops in radius of 600 meters. (2) Resenting a map of all such locations. (3) Creating clusters of those locations to identify a general area which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.


## Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of restaurants in every area candidate**:

In [28]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_locations['Drinking shops in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius = 350m:', np.array(location_restaurants_count).mean())

df_locations.head(10)

Average number of restaurants in every area with radius = 350m: 2.34297520661157


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Drinking shops in area
0,"30 Phan Thúc Duyện, Phường 4, Tân Bình, Hồ Chí...",10.806386,106.659513,15217490.0,10888310.0,11945.259904,3
1,"6/22 Hẻm số 6 Đồ Sơn, Phường 4, Tân Bình, Hồ C...",10.805288,106.659364,15218090.0,10888310.0,11793.609888,2
2,"27 Sầm Sơn, Phường 4, Tân Bình, Hồ Chí Minh",10.80419,106.659216,15218690.0,10888310.0,11670.871184,2
3,"36/14 Hẻm 36 Giải Phóng, Phường 4, Tân Bình, H...",10.803092,106.659068,15219290.0,10888310.0,11577.9633,3
4,"2/15 Hẻm số 2 Đồng Khởi, Phường 4, Tân Bình, H...",10.801995,106.658919,15219890.0,10888310.0,11515.608286,3
5,"15 Cộng Hòa, Phường 4, Tân Bình, Hồ Chí Minh",10.800897,106.658771,15220490.0,10888310.0,11484.303818,4
6,"51 Nguyễn Thái Bình, Phường 4, Tân Bình, Hồ Ch...",10.7998,106.658623,15221090.0,10888310.0,11484.303818,3
7,"308/5 Hoàng Văn Thụ, Phường 4, Tân Bình, Hồ Ch...",10.798702,106.658475,15221690.0,10888310.0,11515.608286,2
8,"40a, 40B Út Tịch, Phường 4, Tân Bình, Hồ Chí Minh",10.797605,106.658326,15222290.0,10888310.0,11577.9633,1
9,"437/13- 437/15 Hoàng Văn Thụ, Phường 4, Tân Bì...",10.796508,106.658178,15222890.0,10888310.0,11670.871184,3


In [29]:
df_locations.shape

(1452, 7)

OK, now let's calculate the **distance to nearest Coffee shop from every area candidate center** (not only those within 300m - we want distance to closest one, regardless of how distant it is).

In [30]:
distances_to_coffee_shop = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    for res in coffee_shops.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_coffee_shop.append(min_distance)

df_locations['Distance to Coffee shop'] = distances_to_coffee_shop

In [31]:
df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Drinking shops in area,Distance to Coffee shop
0,"30 Phan Thúc Duyện, Phường 4, Tân Bình, Hồ Chí...",10.806386,106.659513,15217490.0,10888310.0,11945.259904,3,1936.913461
1,"6/22 Hẻm số 6 Đồ Sơn, Phường 4, Tân Bình, Hồ C...",10.805288,106.659364,15218090.0,10888310.0,11793.609888,2,1422.432468
2,"27 Sầm Sơn, Phường 4, Tân Bình, Hồ Chí Minh",10.80419,106.659216,15218690.0,10888310.0,11670.871184,2,1007.469353
3,"36/14 Hẻm 36 Giải Phóng, Phường 4, Tân Bình, H...",10.803092,106.659068,15219290.0,10888310.0,11577.9633,3,852.452268
4,"2/15 Hẻm số 2 Đồng Khởi, Phường 4, Tân Bình, H...",10.801995,106.658919,15219890.0,10888310.0,11515.608286,3,776.172745
5,"15 Cộng Hòa, Phường 4, Tân Bình, Hồ Chí Minh",10.800897,106.658771,15220490.0,10888310.0,11484.303818,4,471.6415
6,"51 Nguyễn Thái Bình, Phường 4, Tân Bình, Hồ Ch...",10.7998,106.658623,15221090.0,10888310.0,11484.303818,3,583.693624
7,"308/5 Hoàng Văn Thụ, Phường 4, Tân Bình, Hồ Ch...",10.798702,106.658475,15221690.0,10888310.0,11515.608286,2,22.405614
8,"40a, 40B Út Tịch, Phường 4, Tân Bình, Hồ Chí Minh",10.797605,106.658326,15222290.0,10888310.0,11577.9633,1,231.524389
9,"437/13- 437/15 Hoàng Văn Thụ, Phường 4, Tân Bì...",10.796508,106.658178,15222890.0,10888310.0,11670.871184,3,433.3863


In [32]:
print('Average distance to closest Coffee shop from each area center:',
      df_locations['Distance to Coffee shop'].mean())

Average distance to closest Coffee shop from each area center: 1325.8583358295975


OK, so **on average Coffee shop can be found within ~500m** from every area center candidate. That's fairly close, so we need to filter our areas carefully!

Let's crete a map showing **heatmap / density of drinking shops** and try to extract some meaningfull info from that. Also, let's show **borders of central location** on our map and a few circles indicating distance of 1km, 2km and 3km from central location.

In [33]:
restaurant_latlons = [[res[2], res[3]] for res in restaurants.values()]

coffee_latlons = [[res[2], res[3]] for res in coffee_shops.values()]

In [34]:
from folium import plugins
from folium.plugins import HeatMap

map_hcmc = folium.Map(location=hcmc_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_hcmc) #cartodbpositron cartodbdark_matter
HeatMap(restaurant_latlons).add_to(map_hcmc)
folium.Marker(hcmc_center).add_to(map_hcmc)
folium.Circle(hcmc_center, radius=1000, fill=False, color='white').add_to(map_hcmc)
folium.Circle(hcmc_center, radius=2000, fill=False, color='white').add_to(map_hcmc)
folium.Circle(hcmc_center, radius=3000, fill=False, color='white').add_to(map_hcmc)
map_hcmc

Looks like a few pockets of low restaurant density closest to city center can be found **south, south-east and east from Alexanderplatz**. 

Let's create another heatmap map showing **heatmap/density of Coffee shops** only.

In [35]:
map_hcmc = folium.Map(location=hcmc_center, zoom_start=13)
folium.TileLayer('cartodbpositron').add_to(map_hcmc) #cartodbpositron cartodbdark_matter
HeatMap(coffee_latlons).add_to(map_hcmc)
folium.Marker(hcmc_center).add_to(map_hcmc)
folium.Circle(hcmc_center, radius=1000, fill=False, color='white').add_to(map_hcmc)
folium.Circle(hcmc_center, radius=2000, fill=False, color='white').add_to(map_hcmc)
folium.Circle(hcmc_center, radius=3000, fill=False, color='white').add_to(map_hcmc)
map_hcmc

In [42]:
roi_x_min = hcmc_center_x - 2000
roi_y_max = hcmc_center_y + 1000
roi_width = 5000
roi_height = 5000
roi_center_x = roi_x_min + 2500
roi_center_y = roi_y_max - 2500
roi_center_lon, roi_center_lat = xy_to_lonlat(roi_center_x, roi_center_y)
roi_center = [roi_center_lat, roi_center_lon]

map_hcmcn = folium.Map(location=roi_center, zoom_start=14)
HeatMap(restaurant_latlons).add_to(map_hcmc)
folium.Marker(hcmc_center).add_to(map_hcmc)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_hcmc)
map_hcmc

In [43]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')

2261 candidate neighborhood centers generated.


In [44]:
def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_coffee_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, coffee_shops)
    roi_coffee_distances.append(distance)
print('done.')


Generating data on location candidates... done.


In [45]:
# Let's put this into dataframe
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Drinking shops nearby':roi_restaurant_counts,
                                 'Distance to Coffee shop':roi_coffee_distances})

df_roi_locations.head(10)

Unnamed: 0,Latitude,Longitude,X,Y,Drinking shops nearby,Distance to Coffee shop
0,10.79769,106.672423,15221240.0,10895790.0,0,843.916846
1,10.797507,106.672398,15221340.0,10895790.0,0,846.126415
2,10.798675,106.672721,15220690.0,10895880.0,0,514.80624
3,10.798492,106.672696,15220790.0,10895880.0,0,614.504238
4,10.798309,106.672671,15220890.0,10895880.0,0,714.286674
5,10.798126,106.672646,15220990.0,10895880.0,0,814.122501
6,10.797943,106.672621,15221090.0,10895880.0,0,913.994223
7,10.79776,106.672596,15221190.0,10895880.0,0,933.487247
8,10.797578,106.672571,15221290.0,10895880.0,0,930.125103
9,10.797395,106.672546,15221390.0,10895880.0,0,937.479053


OK. Let us now **filter** those locations: we're interested only in **locations with no more than two restaurants in radius of 250 meters**, and **no Coffee shops in radius of 400 meters**.

In [48]:
good_res_count = np.array((df_roi_locations['Drinking shops nearby']<=2))
print('Locations with no more than two drinking shops nearby:', good_res_count.sum())

good_ita_distance = np.array(df_roi_locations['Distance to Coffee shop']>=400)
print('Locations with no Coffee shops within 400m:', good_ita_distance.sum())

good_locations = np.logical_and(good_res_count, good_ita_distance)
print('Locations with both conditions met:', good_locations.sum())

df_good_locations = df_roi_locations[good_locations]


Locations with no more than two drinking shops nearby: 2238
Locations with no Coffee shops within 400m: 1852
Locations with both conditions met: 1834


Looking good. What we have now is a clear indication of zones with low number of restaurants in vicinity, and *no* Coffee shops at all nearby.

Let us now **cluster** those locations to create **centers of zones containing good locations**. Those zones, their centers and addresses will be the final result of our analysis. 

In [53]:
from sklearn.cluster import KMeans

good_latitudes = df_good_locations['Latitude'].values
good_longitudes = df_good_locations['Longitude'].values
number_of_clusters = 15

good_xys = df_good_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_hcmc = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_hcmc)
HeatMap(restaurant_latlons).add_to(map_hcmc)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_hcmc)
folium.Marker(hcmc_center).add_to(map_hcmc)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=700, color='green', fill=True, fill_opacity=0.25).add_to(map_hcmc) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_hcmc)
#map_hcmc

Not bad - our clusters represent groupings of most of the candidate locations and cluster centers are placed nicely in the middle of the zones 'rich' with location candidates.

Addresses of those cluster centers will be a good starting point for exploring the neighborhoods to find the best possible location based on neighborhood specifics.

Let's see those zones on a city map without heatmap, using shaded areas to indicate our clusters:

Finaly, let's **reverse geocode those candidate area centers to get the addresses** which can be presented to stakeholders.

In [52]:
candidate_area_addresses = []
print('==============================================================')
print('Addresses of centers of areas recommended for further analysis')
print('==============================================================\n')
for lon, lat in cluster_centers:
    addr = get_address(google_api_key, lat, lon).replace(', Vietnam', '')
    candidate_area_addresses.append(addr)    
    x, y = lonlat_to_xy(lon, lat)
    d = calc_xy_distance(x, y, hcmc_center_x, hcmc_center_y)
    print('{}{} => {:.1f}km from my Center'.format(addr, ' '*(50-len(addr)), d/1000))
    

Addresses of centers of areas recommended for further analysis

223A Trần Huy Liệu, Phường 8, Phú Nhuận, Hồ Chí Minh => 1.5km from my Center
129 Nguyễn Đình Chính, Phường 8, Phú Nhuận, Hồ Chí Minh => 1.9km from my Center
232 Nguyễn Trọng Tuyển, Phường 8, Phú Nhuận, Hồ Chí Minh => 3.5km from my Center
10 Hoàng Văn Thụ, Phường 9, Phú Nhuận, Hồ Chí Minh => 1.1km from my Center
11/12 Nguyễn Trọng Tuyển, Phường 15, Phú Nhuận, Hồ Chí Minh => 0.6km from my Center
251/4 Nguyễn Văn Trỗi, Phường 10, Phú Nhuận, Hồ Chí Minh => 3.6km from my Center
12/18 Chiến Thắng, Phường 9, Phú Nhuận, Hồ Chí Minh => 2.6km from my Center
94/8 Trần Khắc Chân, Phường 9, Phú Nhuận, Hồ Chí Minh => 1.9km from my Center
90A Nguyễn Trọng Tuyển, Phường 15, Phú Nhuận, Hồ Chí Minh => 0.5km from my Center
10/8i Trần Hữu Trang, Phường 11, Phú Nhuận, Hồ Chí Minh => 3.0km from my Center
73/2/3 Duy Tân, Phường 15, Phú Nhuận, Hồ Chí Minh  => 1.3km from my Center
159/39E Hoàng Văn Thụ, Phường 8, Phú Nhuận, Hồ Chí Minh => 2.5km fr

In [54]:
map_hcmc = folium.Map(location=roi_center, zoom_start=14)
folium.Circle(hcmc_center, radius=50, color='red', fill=True, fill_color='red', fill_opacity=1).add_to(map_hcmc)
for lonlat, addr in zip(cluster_centers, candidate_area_addresses):
    folium.Marker([lonlat[1], lonlat[0]], popup=addr).add_to(map_hcmc) 
for lat, lon in zip(good_latitudes, good_longitudes):
    pass
map_hcmc