# Capstone Project - The Battle of the Neighborhoods 
### Applied Data Science Capstone by IBM/Coursera

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

The demand on restaurants and good food are increasing, not only due to the number of populations that have been increased in the past 5 years. However, it becomes a part of breaking the routine of always eating food at home or bringing special guests for dinner at home. Restaurants usually have various types of dishes that match different tastes and given the fact of a very good reputable restaurant you would never think twice to bring your special guests or even hold a business dinner out there. But, where should you, as an investor, consider opening a new restaurant? What are the factors you should consider when making this decision? The answer would really depend on how you would position yourself among competitors. 

This project aims to study the best place for a new investor to open his or her restaurant. We will assume that this restaurant targets specific segment of customers, mainly those who are planning to host their guests at a very high quality and reputable **Arabic food** in **Manhatten, New York**. It is not about the number of populations in the nearby! It’s about making sure that the investor will open the restaurant in a high-income area, a very well paved infrastructure and a very close to one of the big famous restaurants. You would be wondering, why to take a risk and open a restaurant next to a **very well known and high-quality restaurant**? The answer is simple, but first have you asked yourself why KFC and Popeye’s restaurants are always close to each other? Take it easy, these restaurants are not owned by the same owner or group. It’s about the strategy, Popeye’s strategy is to open a restaurant that is always nearby the KFC, they are taking the bargaining power of KFC customers, that is their audience, and this is their strategy! You like KFC, but you will give yourself an opportunity to try another brand if you, for example, liked its teaser! We will advise the investor to open his restaurant next to one of the other **famous restaurants**, we start by positioning this new investor among these tops! And yes, we will spend on the marketing ads, plan to expand, pricing schema, dishes and most importantly on the weakness and threats of the competitors. 

## DATA <a name="data"></a>

• List of neighborhoods in Manhatten, N.Y, USA <br>
• Latitude and longitude coordinates of those neighborhoods, in order to plot the map <br>
• Data related to highly reviewed Arabic restaurants in Manhatten, New York, in order to perform clustering on the neighborhoods <br>

First, we extract the neighborhoods in N.Y, using web scraping as we did in **Week #3**. Then, we use the Geocoder library to extract the coordinates of each neighborhood. Then, we use the **Foursquare API** to get the venue data for each of the neighborhoods. It provides us with a lot of venues, but we are interested in specific restaurants category, to solve our problem. We also make use of machine learning techniques, such as **K means clustering and map visualization using Folium.**


### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 8x8 killometers centered around New York center.

Let's first find the latitude & longitude of Manhatten center, using specific, well known address and Google Maps geocoding API.

In [57]:
# The code was removed by Watson Studio for sharing.

In [58]:
import requests

def get_coordinates(api_key, address, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(api_key, address)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        geographical_data = results[0]['geometry']['location'] # get geographical coordinates
        lat = geographical_data['lat']
        lon = geographical_data['lng']
        return [lat, lon]
    except:
        return [None, None]
    
address = 'Manhatten, New York, USA'
Manhatten_center = get_coordinates(google_api_key, address)
print('Coordinate of {}: {}'.format(address, Manhatten_center))

Coordinate of Manhatten, New York, USA: [40.7830603, -73.9712488]


Now let's create a grid of area candidates, equaly spaced, centered around city center and within ~6km from Manhatten. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [59]:
#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math

def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('Manhatten center longitude={}, latitude={}'.format(Manhatten_center[1], Manhatten_center[0]))
x, y = lonlat_to_xy(Manhatten_center[1], Manhatten_center[0])
print('Manhatten center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('Manhatten center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
Manhatten center longitude=-73.9712488, latitude=40.7830603
Manhatten center UTM X=-5810246.805659816, Y=9865443.186247082
Manhatten center longitude=-73.97124879999963, latitude=40.78306029999889


Let's create a hexagonal grid of cells: we offset every other row, and adjust vertical row spacing so that every cell center is equally distant from all it's neighbors.

In [60]:
Manhatten_center_x, Manhatten_center_y = lonlat_to_xy(Manhatten_center[1], Manhatten_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_min = Manhatten_center_x - 6000
x_step = 600
y_min = Manhatten_center_y - 6000 - (int(21/k)*k*600 - 12000)/2
y_step = 600 * k 

latitudes = []
longitudes = []
distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = 300 if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(Manhatten_center_x, Manhatten_center_y, x, y)
        if (distance_from_center <= 6001):
            lon, lat = xy_to_lonlat(x, y)
            latitudes.append(lat)
            longitudes.append(lon)
            distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(latitudes), 'candidate neighborhood centers generated.')

364 candidate neighborhood centers generated.


In [61]:
#!pip install folium
import folium
map_Manhatten = folium.Map(location=Manhatten_center, zoom_start=13)
folium.Marker(Manhatten_center, popup='Manhatten').add_to(map_Manhatten)
for lat, lon in zip(latitudes, longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_Manhatten) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_Manhatten)
    #folium.Marker([lat, lon]).add_to(map_Manhatten)
map_Manhatten

OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from Manhatten.

Let's now use Google Maps API to get approximate addresses of those locations.

In [62]:
def get_address(api_key, latitude, longitude, verbose=False):
    try:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&latlng={},{}'.format(api_key, latitude, longitude)
        response = requests.get(url).json()
        if verbose:
            print('Google Maps API JSON result =>', response)
        results = response['results']
        address = results[0]['formatted_address']
        return address
    except:
        return None

addr = get_address(google_api_key, Manhatten_center[0], Manhatten_center[1])
print('Reverse geocoding check')
print('-----------------------')
print('Address of [{}, {}] is: {}'.format(Manhatten_center[0], Manhatten_center[1], addr))

Reverse geocoding check
-----------------------
Address of [40.7830603, -73.9712488] is: 225 Central Park West, New York, NY 10024, USA


In [63]:
print('Obtaining location addresses: ', end='')
addresses = []
for lat, lon in zip(latitudes, longitudes):
    address = get_address(google_api_key, lat, lon)
    if address is None:
        address = 'NO ADDRESS'
    address = address.replace(', USA', '') # We don't need country part of address
    addresses.append(address)
    print(' .', end='')
print(' done.')

Obtaining location addresses:  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . done.


In [13]:
addresses[150:170]

['3000 Broadway, New York, NY 10027',
 'Sakura Park, 3916, 500 Riverside Dr, New York, NY 10027',
 'Riverside Dr/Tiemann Pl, New York, NY 10027',
 '817 1st Avenue, New York, NY 10017',
 '304 E 50th St, New York, NY 10022',
 '909 3rd Ave, New York, NY 10022',
 '145 E 58th St, New York, NY 10022',
 '575 Park Ave, New York, NY 10065',
 '796 Madison Ave, New York, NY 10065',
 '803 Terrace Dr, New York, NY 10021',
 'East Dr, New York, NY 10024',
 '52 79th St Transverse, New York, NY 10024',
 'Great Lawn Oval, New York, NY 10024',
 'Central Pk W/W 88 St, New York, NY 10024',
 '10 W 93rd St, New York, NY 10025',
 '775 Columbus Ave, New York, NY 10025',
 '140 W 102nd St, New York, NY 10025',
 '206 Duke Ellington Blvd, New York, NY 10025',
 '2840 Broadway, New York, NY 10025',
 '425 Riverside Dr, New York, NY 10025']

In [64]:
#Let's now place all this into a Pandas dataframe.
import pandas as pd

df_locations = pd.DataFrame({'Address': addresses,
                             'Latitude': latitudes,
                             'Longitude': longitudes,
                             'X': xs,
                             'Y': ys,
                             'Distance from center': distances_from_center})

df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center
0,"24-43 28th St, Astoria, NY 11102",40.771502,-73.927293,-5812047.0,9859727.0,5992.495307
1,"25-33 14th Pl, Long Island City, NY 11102",40.775041,-73.92716,-5811447.0,9859727.0,5840.3767
2,"I-278, Astoria, NY 11102",40.778579,-73.927028,-5810847.0,9859727.0,5747.173218
3,"40.7816884 -73.9269238, Wards Meadow Loop, New...",40.782118,-73.926895,-5810247.0,9859727.0,5715.767665
4,"125 Hell Gate Cir, New York, NY 10035",40.785657,-73.926762,-5809647.0,9859727.0,5747.173218
5,"Main Rdwy/ Manhattan Psyc Ctr, New York, NY 10035",40.789197,-73.926629,-5809047.0,9859727.0,5840.3767
6,"20 Randalls Is Rd, New York, NY 10035",40.792736,-73.926496,-5808447.0,9859727.0,5992.495307
7,"14-44 31st Dr, Long Island City, NY 11106",40.766282,-73.931522,-5812947.0,9860247.0,5855.766389
8,"30-56 14th St, Long Island City, NY 11102",40.76982,-73.93139,-5812347.0,9860247.0,5604.462508
9,"27-16 28th Ave, Long Island City, NY 11102",40.773359,-73.931258,-5811747.0,9860247.0,5408.326913


In [65]:
#saving the results into local file
df_locations.to_pickle('./locations.pkl')  

### Foursquare
Now that we have our location candidates, let's use Foursquare API to get info on restaurants in each neighborhood.

We're interested in venues in 'food' category, but only those that are proper restaurants - coffe shops, pizza places, bakeries etc. are not direct competitors so we don't care about those. So we will include in out list only venues that have 'restaurant' in category name, and we'll make sure to detect and include all the subcategories of specific 'Arabic restaurant' category, as we need info on Arabic restaurants in the neighborhood.

In [66]:
# FourSquare Credentials 
foursquare_client_id = 'VXNURCRV1MDFLMOHBSGVDL5AU1B0UQBO2MQI2SBKN5CNSJ5Z'
foursquare_client_secret = '3AVGSTEBBSU4R2QYAFXUUTZAUVUE4K3RIAMLXASDUOXWGH52'
VERSION = '20180605'
#LIMIT = 10
#radius = 5

In [67]:
# Category IDs corresponding to Arabic restaurants were taken from Foursquare web site (https://developer.foursquare.com/docs/resources/categories):

food_category = '4d4b7105d754a06374d81259' # 'Root' category for all food-related venues

arabic_restaurant_categories = ['52e81612bcbc57f1066b79ff']

def is_restaurant(categories, specific_filter=None):
    restaurant_words = ['restaurant', 'diner', 'taverna', 'steakhouse','arabic','halal']
    restaurant = False
    specific = False
    for c in categories:
        category_name = c[0].lower()
        category_id = c[1]
        for r in restaurant_words:
            if r in category_name:
                restaurant = True
        if 'fast food' in category_name:
            restaurant = False
        if not(specific_filter is None) and (category_id in specific_filter):
            specific = True
            restaurant = True
    return restaurant, specific

def get_categories(categories):
    return [(cat['name'], cat['id']) for cat in categories]

def format_address(location):
    address = ', '.join(location['formattedAddress'])
    address = address.replace(', Deutschland', '')
    address = address.replace(', Germany', '')
    return address

def get_venues_near_location(lat, lon, category, client_id, client_secret, radius=500, limit=100):
    version = '20180724'
    url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&categoryId={}&radius={}&limit={}'.format(
        client_id, client_secret, version, lat, lon, category, radius, limit)
    try:
        results = requests.get(url).json()['response']['groups'][0]['items']
        venues = [(item['venue']['id'],
                   item['venue']['name'],
                   get_categories(item['venue']['categories']),
                   (item['venue']['location']['lat'], item['venue']['location']['lng']),
                   format_address(item['venue']['location']),
                   item['venue']['location']['distance']) for item in results]        
    except:
        venues = []
    return venues

In [68]:
# Let's now go over our neighborhood locations and get nearby restaurants; we'll also maintain a dictionary of all found restaurants and all found arabic restaurants

import pickle

def get_restaurants(lats, lons):
    restaurants = {}
    arabic_restaurants = {}
    location_restaurants = []

    print('Obtaining venues around candidate locations:', end='')
    for lat, lon in zip(lats, lons):
        # Using radius=350 to meke sure we have overlaps/full coverage so we don't miss any restaurant (we're using dictionaries to remove any duplicates resulting from area overlaps)
        venues = get_venues_near_location(lat, lon, food_category, foursquare_client_id, foursquare_client_secret, radius=350, limit=100)
        area_restaurants = []
        for venue in venues:
            venue_id = venue[0]
            venue_name = venue[1]
            venue_categories = venue[2]
            venue_latlon = venue[3]
            venue_address = venue[4]
            venue_distance = venue[5]
            is_res, is_arabic = is_restaurant(venue_categories, specific_filter=arabic_restaurant_categories)
            if is_res:
                x, y = lonlat_to_xy(venue_latlon[1], venue_latlon[0])
                restaurant = (venue_id, venue_name, venue_latlon[0], venue_latlon[1], venue_address, venue_distance, is_arabic, x, y)
                if venue_distance<=300:
                    area_restaurants.append(restaurant)
                restaurants[venue_id] = restaurant
                if is_arabic:
                    arabic_restaurants[venue_id] = restaurant
        location_restaurants.append(area_restaurants)
        print(' .', end='')
    print(' done.')
    return restaurants, arabic_restaurants, location_restaurants

# Try to load from local file system in case we did this before
restaurants = {}
arabic_restaurants = {}
location_restaurants = []
loaded = False
try:
    with open('restaurants_350.pkl', 'rb') as f:
        restaurants = pickle.load(f)
    with open('arabic_restaurants_350.pkl', 'rb') as f:
        arabic_restaurants = pickle.load(f)
    with open('location_restaurants_350.pkl', 'rb') as f:
        location_restaurants = pickle.load(f)
    print('Restaurant data loaded.')
    loaded = True
except:
    pass

# If load failed use the Foursquare API to get the data
if not loaded:
    restaurants, arabic_restaurants, location_restaurants = get_restaurants(latitudes, longitudes)
    
    # Let's persists this in local file system
    with open('restaurants_350.pkl', 'wb') as f:
        pickle.dump(restaurants, f)
    with open('arabic_restaurants_350.pkl', 'wb') as f:
        pickle.dump(arabic_restaurants, f)
    with open('location_restaurants_350.pkl', 'wb') as f:
        pickle.dump(location_restaurants, f)
        

Restaurant data loaded.


In [28]:
import numpy as np

print('Total number of restaurants:', len(restaurants))
print('Total number of Arabic restaurants:', len(arabic_restaurants))
print('Percentage of Arabic restaurants: {:.2f}%'.format(len(arabic_restaurants) / len(restaurants) * 100))
print('Average number of restaurants in neighborhood:', np.array([len(r) for r in location_restaurants]).mean())

Total number of restaurants: 1786
Total number of Arabic restaurants: 4
Percentage of Arabic restaurants: 0.22%
Average number of restaurants in neighborhood: 9.085164835164836


In [29]:
print('List of all restaurants')
print('-----------------------')
for r in list(restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(restaurants))

List of all restaurants
-----------------------
('49c68eaaf964a5205c571fe3', 'Vesta Trattoria & Wine Bar', 40.76980934497303, -73.9277960938928, '21-02 30th Ave (at 21st St), Astoria, NY 11102, United States', 302, False, -5812335.5138223935, 9859784.009561414)
('597cdab7bd40092a30c3f2bc', 'Astoria Provisions', 40.77218710579064, -73.92895119263953, '12-23 Astoria Blvd (at 14th St), Long Island City, NY 11102, United States', 234, False, -5811936.871087021, 9859944.285675734)
('4ab6d166f964a5202f7920e3', 'Roti Boti', 40.77198140414186, -73.92612563916222, '2709 21st St (btw Astoria Blvd & 27th Rd), Astoria, NY 11102, United States', 111, False, -5811961.344086435, 9859579.330063501)
('502e6f61e4b0eed9c3113816', 'El Ancla', 40.77097725983302, -73.92705823255376, '28-08 21st St, Astoria, NY 11102, United States', 61, False, -5812134.911398285, 9859694.603966929)
('4f345f48e4b03a18765668a9', 'La Herradura', 40.771986322122, -73.92548524462315, '2109 Astoria Blvd (Astoria Blvd and Newtown)

In [30]:
print('List of Arabic restaurants')
print('---------------------------')
for r in list(arabic_restaurants.values())[:10]:
    print(r)
print('...')
print('Total:', len(arabic_restaurants))

List of Arabic restaurants
---------------------------
('4bc905a60687ef3bac39d9cc', 'Halal Kitchen', 40.793135, -73.940916, '2135 2nd Ave (110 n 109), New York, NY 10029, United States', 284, True, -5808431.91845712, 9861585.409070749)
('545234a5498e4c349551762e', 'The Halal Guys', 40.79351975535894, -73.97088852812787, '720 Amsterdam Ave (95th), New York, NY 10025, United States', 339, True, -5808473.825832567, 9865445.264228817)
('4a19d1edf964a5205a7a1fe3', 'The Halal Guys', 40.763826108090846, -73.98018212354839, 'W 53rd St (at 6th Ave), New York, NY 10019, United States', 16, True, -5813537.171902473, 9866505.137764478)
('57ae0d51498e266840e587f8', 'Platter King', 40.800363, -74.007485, '7704 Bergenline Ave (77th Street), North Bergen, NJ 07047, United States', 304, True, -5807441.381061337, 9870186.661837654)
...
Total: 4


In [31]:
print('Restaurants around location')
print('---------------------------')
for i in range(100, 110):
    rs = location_restaurants[i][:8]
    names = ', '.join([r[1] for r in rs])
    print('Restaurants around location {}: {}'.format(i+1, names))

Restaurants around location
---------------------------
Restaurants around location 101: Kome Waza UES, Tanoshi Sushi, Sushi Ishikawa, Chipotle Mexican Grill, Campagnola Restaurant, Bohemian Spirit Restaurant, Sophie's Cuban Cuisine, Delizia 73 Ristorante & Pizza
Restaurants around location 102: Up Thai, Boqueria, Uva, THEP Thai Restaurant, The Meatball Shop, Jones Wood Foundry, Heidi's House By The Side Of The Road, Sushi Ishikawa
Restaurants around location 103: Heidi's House By The Side Of The Road, Uva, Luke's Lobster, San Matteo Pizzeria e Cucina, Caffe Buon Gusto - Manhattan, Pil Pil, Calexico, Quality Eats
Restaurants around location 104: Flex Mussels, Erminia Ristorante, Toloache 82, Elio's, Dulce Vida Latin Bistro, Antonucci, The Simone, Beyoglu
Restaurants around location 105: Dig Inn, Lex Restaurant, Ooki Sushi, Naruto Ramen, Wok 88, Guzan Sushi & Bar, Lolita's Kitchen, Peri Ela
Restaurants around location 106: Paola's Restaurant, Russ & Daughters, Table d'Hote, Sfoglia, Lex

In [69]:
map_Manhatten = folium.Map(location=Manhatten_center, zoom_start=13)
folium.Marker(Manhatten_center, popup='Manhatten').add_to(map_Manhatten)
for res in restaurants.values():
    lat = res[2]; lon = res[3]
    is_arabic = res[6]
    color = 'red' if is_arabic else 'blue'
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_Manhatten)
map_Manhatten

Looking good. So now we have all the restaurants in area within few kilometers from Manhatten, and we know which ones are Arabic restaurants! We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Arabic restaurant!

## Methodology  <a name="methodology"></a>


In first step we have collected the required **data: location and type (category) of every restaurant within 6km from Manhatten center** (N.Y). We have also **identified Arabic restaurants** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**restaurant density**' across different areas of Manhatten - we will use **heatmaps** to identify a few promising areas close to center and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

## Analysis <a name="analysis"></a>
Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the number of restaurants in every area candidate:

In [37]:
location_restaurants_count = [len(res) for res in location_restaurants]

df_locations['Restaurants in area'] = location_restaurants_count

print('Average number of restaurants in every area with radius=300m:', np.array(location_restaurants_count).mean())

df_locations.head(10)

Average number of restaurants in every area with radius=300m: 9.085164835164836


Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area
0,"24-43 28th St, Astoria, NY 11102",40.771502,-73.927293,-5812047.0,9859727.0,5992.495307,7
1,"25-33 14th Pl, Long Island City, NY 11102",40.775041,-73.92716,-5811447.0,9859727.0,5840.3767,2
2,"I-278, Astoria, NY 11102",40.778579,-73.927028,-5810847.0,9859727.0,5747.173218,0
3,"40.7816884 -73.9269238, Wards Meadow Loop, New...",40.782118,-73.926895,-5810247.0,9859727.0,5715.767665,0
4,"125 Hell Gate Cir, New York, NY 10035",40.785657,-73.926762,-5809647.0,9859727.0,5747.173218,0
5,"Main Rdwy/ Manhattan Psyc Ctr, New York, NY 10035",40.789197,-73.926629,-5809047.0,9859727.0,5840.3767,0
6,"20 Randalls Is Rd, New York, NY 10035",40.792736,-73.926496,-5808447.0,9859727.0,5992.495307,2
7,"14-44 31st Dr, Long Island City, NY 11106",40.766282,-73.931522,-5812947.0,9860247.0,5855.766389,8
8,"30-56 14th St, Long Island City, NY 11102",40.76982,-73.93139,-5812347.0,9860247.0,5604.462508,2
9,"27-16 28th Ave, Long Island City, NY 11102",40.773359,-73.931258,-5811747.0,9860247.0,5408.326913,2


OK, now let's calculate the distance to nearest Arabic restaurant from every area candidate center (not only those within 300m - we want distance to closest one, regardless of how distant it is).

In [56]:
distances_to_arabic_restaurant = []

for area_x, area_y in zip(xs, ys):
    min_distance = 10000
    for res in arabic_restaurants.values():
        res_x = res[7]
        res_y = res[8]
        d = calc_xy_distance(area_x, area_y, res_x, res_y)
        if d<min_distance:
            min_distance = d
    distances_to_arabic_restaurant.append(min_distance)

df_locations['Distance to Arabic restaurant'] = distances_to_arabic_restaurant
df_locations.head(10)

Unnamed: 0,Address,Latitude,Longitude,X,Y,Distance from center,Restaurants in area,Distance to Arabic restaurant
0,"24-43 28th St, Astoria, NY 11102",40.771502,-73.927293,-5812047.0,9859727.0,5992.495307,7,4064.42347
1,"25-33 14th Pl, Long Island City, NY 11102",40.775041,-73.92716,-5811447.0,9859727.0,5840.3767,2,3541.422525
2,"I-278, Astoria, NY 11102",40.778579,-73.927028,-5810847.0,9859727.0,5747.173218,0,3046.934338
3,"40.7816884 -73.9269238, Wards Meadow Loop, New...",40.782118,-73.926895,-5810247.0,9859727.0,5715.767665,0,2597.295558
4,"125 Hell Gate Cir, New York, NY 10035",40.785657,-73.926762,-5809647.0,9859727.0,5747.173218,0,2219.92783
5,"Main Rdwy/ Manhattan Psyc Ctr, New York, NY 10035",40.789197,-73.926629,-5809047.0,9859727.0,5840.3767,0,1957.09349
6,"20 Randalls Is Rd, New York, NY 10035",40.792736,-73.926496,-5808447.0,9859727.0,5992.495307,2,1858.05013
7,"14-44 31st Dr, Long Island City, NY 11106",40.766282,-73.931522,-5812947.0,9860247.0,5855.766389,8,4709.082156
8,"30-56 14th St, Long Island City, NY 11102",40.76982,-73.93139,-5812347.0,9860247.0,5604.462508,2,4137.340947
9,"27-16 28th Ave, Long Island City, NY 11102",40.773359,-73.931258,-5811747.0,9860247.0,5408.326913,2,3574.874189


In [40]:
print('Average distance to closest Arabic restaurant from each area center:', df_locations['Distance to Arabic restaurant'].mean())

Average distance to closest Arabic restaurant from each area center: 2349.9668585471973


This is a very promising results

In [46]:
k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
x_step = 100
y_step = 100 * k 
roi_y_min = roi_center_y - 2500

roi_latitudes = []
roi_longitudes = []
roi_xs = []
roi_ys = []
for i in range(0, int(51/k)):
    y = roi_y_min + i * y_step
    x_offset = 50 if i%2==0 else 0
    for j in range(0, 51):
        x = roi_x_min + j * x_step + x_offset
        d = calc_xy_distance(roi_center_x, roi_center_y, x, y)
        if (d <= 2501):
            lon, lat = xy_to_lonlat(x, y)
            roi_latitudes.append(lat)
            roi_longitudes.append(lon)
            roi_xs.append(x)
            roi_ys.append(y)

print(len(roi_latitudes), 'candidate neighborhood centers generated.')

def count_restaurants_nearby(x, y, restaurants, radius=250):    
    count = 0
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=radius:
            count += 1
    return count

def find_nearest_restaurant(x, y, restaurants):
    d_min = 100000
    for res in restaurants.values():
        res_x = res[7]; res_y = res[8]
        d = calc_xy_distance(x, y, res_x, res_y)
        if d<=d_min:
            d_min = d
    return d_min

roi_restaurant_counts = []
roi_arabic_distances = []

print('Generating data on location candidates... ', end='')
for x, y in zip(roi_xs, roi_ys):
    count = count_restaurants_nearby(x, y, restaurants, radius=250)
    roi_restaurant_counts.append(count)
    distance = find_nearest_restaurant(x, y, arabic_restaurants)
    roi_arabic_distances.append(distance)
print('done.')


# Let's put this into dataframe
df_roi_locations = pd.DataFrame({'Latitude':roi_latitudes,
                                 'Longitude':roi_longitudes,
                                 'X':roi_xs,
                                 'Y':roi_ys,
                                 'Restaurants nearby':roi_restaurant_counts,
                                 'Distance to Arabic restaurant':roi_arabic_distances})

df_roi_locations.head(10)

2261 candidate neighborhood centers generated.
Generating data on location candidates... done.


Unnamed: 0,Latitude,Longitude,X,Y,Restaurants nearby,Distance to Arabic restaurant
0,40.78506,-73.94011,-5809797.0,9861443.0,0,1372.277089
1,40.785649,-73.940089,-5809697.0,9861443.0,0,1272.857795
2,40.78183,-73.940903,-5810347.0,9861530.0,0,1915.694813
3,40.78242,-73.940881,-5810247.0,9861530.0,0,1815.739291
4,40.783009,-73.940859,-5810147.0,9861530.0,0,1715.788954
5,40.783599,-73.940837,-5810047.0,9861530.0,0,1615.844762
6,40.784189,-73.940815,-5809947.0,9861530.0,0,1515.90793
7,40.784779,-73.940793,-5809847.0,9861530.0,0,1415.980018
8,40.785369,-73.940772,-5809747.0,9861530.0,0,1316.063058
9,40.785959,-73.94075,-5809647.0,9861530.0,0,1216.159747


In [47]:
from sklearn.cluster import KMeans

number_of_clusters = 15

good_xys = df_roi_locations[['X', 'Y']].values
kmeans = KMeans(n_clusters=number_of_clusters, random_state=0).fit(good_xys)

cluster_centers = [xy_to_lonlat(cc[0], cc[1]) for cc in kmeans.cluster_centers_]

map_Manhatten = folium.Map(location=roi_center, zoom_start=14)
folium.TileLayer('cartodbpositron').add_to(map_Manhatten)
HeatMap(restaurant_latlons).add_to(map_Manhatten)
folium.Circle(roi_center, radius=2500, color='white', fill=True, fill_opacity=0.4).add_to(map_Manhatten)
folium.Marker(berlin_center).add_to(map_berlin)
for lon, lat in cluster_centers:
    folium.Circle([lat, lon], radius=500, color='green', fill=True, fill_opacity=0.25).add_to(map_Manhatten) 
for lat, lon in zip(good_latitudes, good_longitudes):
    folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_Manhatten)
folium.GeoJson(berlin_boroughs, style_function=boroughs_style, name='geojson').add_to(map_Manhatten)
map_Manhatten

## Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of restaurants in New York, Manhatten. However, Arabic restaurants are very few which grants us a privilege to be assured about this new business.
Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.
Result of all this is 15 zones containing largest number of potential new restaurant locations based on number of and distance to existing venues - both restaurants in general and Arabic restaurants particularly. This, of course, does not imply that those zones are optimal locations for a new restaurant! Purpose of this analysis was to only provide info on areas close to Manhatten center but not crowded with existing restaurants (particularly Arabic) - it is entirely possible that there is a very good reason for small number of restaurants in any of those areas, reasons which would make them unsuitable for a new restaurant regardless of lack of competition in the area. Recommended zones should therefore be considered only as a starting point for more detailed analysis which could eventually result in location which has not only no nearby competition, but also other factors considered, and all other relevant conditions met.


## Conclusion <a name = "conclusion"> </a>
Purpose of this project was to identify Manhatten areas close to center with low number of restaurants (particularly Arabic restaurants) in order to aid stakeholders in narrowing down the search for optimal location for a new Arabic restaurant.  Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal restaurant location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.