# Capstone Project
#### Applied Data Science Capstone by IBM/Coursera


### Table of contents
* [Introduction](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)


# - Introduction: Business Problem <a name="introduction"></a>


In this project we will try to find an optimal location for a Gym. Specifically, this report will be targeted to stakeholders interested in opening an **business** in **New York City**, USA.

Since there are lots of Gym's in New York we will try to detect **locations that are not already crowded with business**. We are also particularly interested in **areas with no business in vicinity**. We would also prefer locations **as close to city center as possible**, assuming that first two conditions are met.

We will use our data science powers to generate a few most promissing neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that best possible final location can be chosen by stakeholders.

# - Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* number of existing business in the neighborhood (any type of business)
* number of and distance to business in the neighborhood, if any
* distance of neighborhood from city center

We decided to use regularly spaced grid of locations, centered around city center, to define our neighborhoods.

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be search and choise and approximate addresses of centers of those areas will be obtained using **Foursquare API**
* number of Gym's and their type and location in every neighborhood will be obtained using **Foursquare API**
* coordinate of New York center will be obtained using **Foursquare API** of well known New York location

### Neighborhood Candidates

Let's create latitude & longitude coordinates for centroids of our candidate neighborhoods. We will create a grid of cells covering our area of interest which is aprox. 12x12 killometers centered around New York city center.

Let's first find the latitude & longitude of New York city center, using specific, well known address and Foursquare API.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

# from IPython.display import HTML, display

### By choise we use the location of : Highland Park, New York, USA

In [2]:
address = 'Highland Park, New York, USA'

geolocator = Nominatim(user_agent="to_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
ny_center = [latitude, longitude]
print(ny_center)
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

[40.6871747, -73.88774612903762]
The geograpical coordinate of New York City are 40.6871747, -73.88774612903762.


In [3]:
# address = '58th St. and Queens Blvd, New York, USA'
# https://www.atlasobscura.com/places/geographic-center-of-new-york-city

Now let's create a grid of area candidates, equaly spaced, centered around city center and within ~6km from Highland Park. Our neighborhoods will be defined as circular areas with a radius of 300 meters, so our neighborhood centers will be 600 meters apart.

To accurately calculate distances we need to create our grid of locations in Cartesian 2D coordinate system which allows us to calculate distances in meters (not in latitude/longitude degrees). Then we'll project those coordinates back to latitude/longitude degrees to be shown on Folium map. So let's create functions to convert between WGS84 spherical coordinate system (latitude/longitude degrees) and UTM Cartesian coordinate system (X/Y coordinates in  meters).

In [4]:
from functools import partial
from pyproj import Proj, transform

#!pip install shapely
import shapely.geometry
#!pip install pyproj
import pyproj
import math

# projection 2: UTM zone 15, clrk66 ellipse, NAD27 datum
p1  = pyproj.Proj(init='epsg:2263') # p1 # , preserve_units=False
p2 = pyproj.Proj(init='epsg:4326') #p2 , preserve_units=False

# find x,y of location
x1, y1 = p1(ny_center[1],ny_center[0])

# transform this point to projection 2 coordinates.
transformer = partial(transform, p1, p2)
x2, y2 = transformer(x1,y1)

print(x1,y1)
print(x2,y2)
print( p2(x2,y2,inverse=True))


1015381.8256263086 189652.65225018133
-73.88774612903762 40.68717470093264
(-73.88774612903762, 40.68717470093264)


  return _prepare_from_string(" ".join(pjargs))
  projstring = _prepare_from_string(" ".join((projstring, projkwargs)))
  return _prepare_from_string(" ".join(pjargs))
  projstring = _prepare_from_string(" ".join((projstring, projkwargs)))


In [5]:
#!pip install shapely
import shapely.geometry

#!pip install pyproj
import pyproj

import math


def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

print('Coordinate transformation check')
print('-------------------------------')
print('New York center longitude={}, latitude={}'.format(ny_center[1], ny_center[0]))
x, y = lonlat_to_xy(ny_center[1], ny_center[0])
print('New York center UTM X={}, Y={}'.format(x, y))
lo, la = xy_to_lonlat(x, y)
print('New York center longitude={}, latitude={}'.format(lo, la))

Coordinate transformation check
-------------------------------
New York center longitude=-73.88774612903762, latitude=40.6871747
New York center UTM X=-5826198.208165835, Y=9854210.041127639
New York center longitude=-73.88774612903721, latitude=40.68717469999886


  del sys.path[0]


Let's create a **hexagonal grid of cells**: we offset every other row, and adjust vertical row spacing so that **every cell center is equally distant from all it's neighbors**.

In [6]:
ny_center_x, ny_center_y = lonlat_to_xy(ny_center[1], ny_center[0]) # City center in Cartesian coordinates

k = math.sqrt(3) / 2 # Vertical offset for hexagonal grid cells
distance = 6000
distance_center = 6001
x_min = ny_center_x - distance
x_step = distance * 0.1
y_min = ny_center_y - distance - (int(21/k) * k * distance * 0.1 - distance * 2)/2
y_step = distance * 0.1 * k

list_latitudes = []
list_longitudes = []
list_distances_from_center = []
xs = []
ys = []
for i in range(0, int(21/k)):
    y = y_min + i * y_step
    x_offset = (distance * 0.1 / 2) if i%2==0 else 0
    for j in range(0, 21):
        x = x_min + j * x_step + x_offset
        distance_from_center = calc_xy_distance(ny_center_x, ny_center_y, x, y)
        if (distance_from_center <= distance_center):
            lon, lat = xy_to_lonlat(x, y)
            list_latitudes.append(lat)
            list_longitudes.append(lon)
            list_distances_from_center.append(distance_from_center)
            xs.append(x)
            ys.append(y)

print(len(list_latitudes), 'candidate neighborhood centers generated.')

  del sys.path[0]


364 candidate neighborhood centers generated.




504 candidate neighborhood centers generated.


In [7]:
map_ny = folium.Map(location=ny_center, zoom_start=13)
folium.Marker(ny_center, popup='58th St. and Queens Blvd').add_to(map_ny)
for lat, lon in zip(list_latitudes, list_longitudes):
    #folium.CircleMarker([lat, lon], radius=2, color='blue', fill=True, fill_color='blue', fill_opacity=1).add_to(map_berlin) 
    folium.Circle([lat, lon], radius=300, color='blue', fill=False).add_to(map_ny)
    #folium.Marker([lat, lon]).add_to(map_berlin)
map_ny

OK, we now have the coordinates of centers of neighborhoods/areas to be evaluated, equally spaced (distance from every point to it's neighbors is exactly the same) and within ~6km from H. Park.

Let's now use Foursquare API to get approximate addresses of those locations.

In [8]:
print(ny_center[0])
print(ny_center[1])

# Just 1 time
list_latitudes.append(ny_center[0])
list_longitudes.append(ny_center[1])
list_distances_from_center.append(0)
xs.append(x)
ys.append(y)

df = { 'lat': list_latitudes, 'lng': list_longitudes, 'dist_center': list_distances_from_center , 'X' : xs, 'Y' : ys}
df_location_centers = pd.DataFrame( df )
print( df_location_centers.shape )
df_location_centers.head()

40.6871747
-73.88774612903762
(365, 5)


Unnamed: 0,lat,lng,dist_center,X,Y
0,40.675562,-73.843971,5992.495307,-5827998.0,9848494.0
1,40.679094,-73.843828,5840.3767,-5827398.0,9848494.0
2,40.682626,-73.843685,5747.173218,-5826798.0,9848494.0
3,40.686158,-73.843542,5715.767665,-5826198.0,9848494.0
4,40.68969,-73.843399,5747.173218,-5825598.0,9848494.0


In [9]:
df_location_centers.tail()

Unnamed: 0,lat,lng,dist_center,X,Y
360,40.688152,-73.931952,5715.767665,-5826198.0,9859926.0
361,40.691684,-73.93182,5747.173218,-5825598.0,9859926.0
362,40.695217,-73.931688,5840.3767,-5824998.0,9859926.0
363,40.69875,-73.931556,5992.495307,-5824398.0,9859926.0
364,40.687175,-73.887746,0.0,-5820198.0,9859926.0


### Check the business becid from Center

In [10]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
ACCESS_TOKEN = '' # your FourSquare Access Token
VERSION = '20180605'
LIMIT = 100
radius = 600

In [11]:
latitude_ny = df_location_centers['lat'][len(df_location_centers)-1]
longitude_ny = df_location_centers['lng'][len(df_location_centers)-1]
print(latitude_ny)
print(longitude_ny)

40.6871747
-73.88774612903762


In [None]:
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    latitude_ny, 
    longitude_ny, 
    radius, 
    LIMIT)
url

In [None]:
print(url)
results = requests.get(url).json()

In [14]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [15]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.location.distance', 'venue.location.formattedAddress' ]
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

# add column of location reference
nearby_venues['ref_lat'] = [latitude_ny]*len(nearby_venues)
nearby_venues['ref_lng'] = [longitude_ny]*len(nearby_venues)


print(nearby_venues.shape)
nearby_venues.head(20)

(7, 8)


Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng
0,Highland Park,Park,40.682002,-73.889173,588,"[Brooklyn, NY 11208, United States]",40.687175,-73.887746
1,North Brooklyn YMCA,Gym / Fitness Center,40.682066,-73.889165,581,"[570 Jamaica Ave (at Shepherd Ave.), Brooklyn,...",40.687175,-73.887746
2,Yogo Delight,Ice Cream Shop,40.689076,-73.891238,362,"[Brooklyn, NY 11201, United States]",40.687175,-73.887746
3,Jackie Robinson Parkway at Exit 2,Intersection,40.690494,-73.889445,396,"[Jackie Robinson Pkwy (at Cypress Ave), Brookl...",40.687175,-73.887746
4,Highland Park Children's Garden,Garden,40.683882,-73.885818,401,"[Jamaica Ave (Warwick & Force Tube), Brooklyn,...",40.687175,-73.887746
5,Z.MIAH & SON CONTRACTING CO,Construction & Landscaping,40.683488,-73.889487,435,"[23 Sunnyside Ct Fl 1, Brooklyn, NY 11207, Uni...",40.687175,-73.887746
6,Mandate of Heaven,Boutique,40.68411,-73.883551,491,"[17 Essex St (Between Hester and Canal), Brook...",40.687175,-73.887746


In [16]:
nearby_venues['distance'].describe()

count      7.000000
mean     464.857143
std       90.945824
min      362.000000
25%      398.500000
50%      435.000000
75%      536.000000
max      588.000000
Name: distance, dtype: float64

###  Function Verif each Center point in Follion Map

In [17]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605'
LIMIT = 100
radius = 600

In [25]:
df_location_centers.head()

Unnamed: 0,lat,lng,dist_center,X,Y
0,40.675562,-73.843971,5992.495307,-5827998.0,9848494.0
1,40.679094,-73.843828,5840.3767,-5827398.0,9848494.0
2,40.682626,-73.843685,5747.173218,-5826798.0,9848494.0
3,40.686158,-73.843542,5715.767665,-5826198.0,9848494.0
4,40.68969,-73.843399,5747.173218,-5825598.0,9848494.0


In [29]:
for index, row in df_location_centers.iterrows():
    
    print( "{} .".format(index), end='' )

    url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(    CLIENT_ID,     CLIENT_SECRET,     VERSION,  row['lat'],  row['lng'], radius, LIMIT)

    results = requests.get(url).json()

    venues = results['response']['groups'][0]['items']

    nearby_venues = json_normalize(venues) # flatten JSON

    # filter columns
    filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng', 'venue.location.distance', 'venue.location.formattedAddress' ]
    nearby_venues =nearby_venues.loc[:, filtered_columns]

    nearby_venues
    # filter the category for each row
    nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

    # clean columns
    nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

    # add column of location reference
    nearby_venues['ref_lat'] = [row['lat']]*len(nearby_venues)
    nearby_venues['ref_lng'] = [row['lng']]*len(nearby_venues)

    if index == 0 :
        df_nearby_venues = nearby_venues
    else:
        df_nearby_venues = df_nearby_venues.append(nearby_venues, ignore_index=True)
    # print(df_nearby_venues.shape)


0 .1 .2 .3 .4 .5 .6 .7 .8 .9 .10 .11 .12 .13 .14 .15 .16 .17 .18 .19 .20 .21 .22 .23 .24 .25 .26 .27 .28 .29 .30 .31 .32 .33 .34 .35 .36 .37 .38 .39 .40 .41 .42 .43 .44 .45 .46 .47 .48 .49 .50 .51 .52 .53 .54 .55 .56 .57 .58 .59 .60 .61 .62 .63 .64 .65 .66 .67 .68 .69 .70 .71 .72 .73 .74 .75 .76 .77 .78 .79 .80 .81 .82 .83 .84 .85 .86 .87 .88 .89 .90 .91 .92 .93 .94 .95 .96 .97 .98 .99 .100 .101 .102 .103 .104 .105 .106 .107 .108 .109 .110 .111 .112 .113 .114 .115 .116 .117 .118 .119 .120 .121 .122 .123 .124 .125 .126 .127 .128 .129 .130 .131 .132 .133 .134 .135 .136 .137 .138 .139 .140 .141 .142 .143 .144 .145 .146 .147 .148 .149 .150 .151 .152 .153 .154 .155 .156 .157 .158 .159 .160 .161 .162 .163 .164 .165 .166 .167 .168 .169 .170 .171 .172 .173 .174 .175 .176 .177 .178 .179 .180 .181 .182 .183 .184 .185 .186 .187 .188 .189 .190 .191 .192 .193 .194 .195 .196 .197 .198 .199 .200 .201 .202 .203 .204 .205 .206 .207 .208 .209 .210 .211 .212 .213 .214 .215 .216 .217 .218 .219 .220 .221 .

In [30]:
df_nearby_venues.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11457 entries, 0 to 11456
Data columns (total 8 columns):
name                11457 non-null object
categories          11457 non-null object
lat                 11457 non-null float64
lng                 11457 non-null float64
distance            11457 non-null int64
formattedAddress    11457 non-null object
ref_lat             11457 non-null float64
ref_lng             11457 non-null float64
dtypes: float64(4), int64(1), object(3)
memory usage: 716.1+ KB


In [31]:
df_nearby_venues['distance'].describe()

count    11457.000000
mean       412.078555
std        138.063219
min          5.000000
25%        319.000000
50%        439.000000
75%        528.000000
max        601.000000
Name: distance, dtype: float64

In [32]:
df_nearby_venues['categories'].value_counts().head(8)

Pizza Place             699
Deli / Bodega           674
Bar                     384
Chinese Restaurant      352
Discount Store          294
Grocery Store           282
Coffee Shop             277
Fast Food Restaurant    257
Name: categories, dtype: int64

In [33]:
df_nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng
0,Beer Town,Beer Store,40.67278,-73.8438,310,"[135-26 Crossbay Blvd (at Desarc Rd), Ozone Pa...",40.675562,-73.843971
1,Natural Body Inc.,Health Food Store,40.672935,-73.843796,292,"[135-26 Cross Bay Blvd (Desarc Rd), Ozone Park...",40.675562,-73.843971
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971
3,CJ's Bar & Lounge,Lounge,40.671836,-73.842968,423,[137-09 Crossbay Blvd (btwn Pitkin & 149th Ave...,40.675562,-73.843971
4,Mia Halal Food,Restaurant,40.680003,-73.844438,495,[105-07 Crossbay Blvd (LIBERTY AVE & 107TH AVE...,40.675562,-73.843971


In [34]:
df_nearby_venues['id_ref'] = df_nearby_venues['ref_lat'].astype(str) + '_' + df_nearby_venues['ref_lng'].astype(str)
df_nearby_venues['id_loc'] = df_nearby_venues['lat'].astype(str) + '_' + df_nearby_venues['lng'].astype(str)
# df_nearby_venues['id_ref'].value_counts()
# df_nearby_venues['qtd_ref_loc'] = df_nearby_venues.groupby(['id_ref']).transform('count')
df_nearby_venues['qtd_spot_loc'] = df_nearby_venues['id_ref'].map(df_nearby_venues['id_ref'].value_counts())
df_nearby_venues['qtd_gym_loc'] = df_nearby_venues['id_ref'].map(df_nearby_venues[df_nearby_venues['categories'].str.contains("Gym")]['id_ref'].value_counts())

df_nearby_venues.head()

Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc
0,Beer Town,Beer Store,40.67278,-73.8438,310,"[135-26 Crossbay Blvd (at Desarc Rd), Ozone Pa...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67277981804149_-73.84380039218584,42,2.0
1,Natural Body Inc.,Health Food Store,40.672935,-73.843796,292,"[135-26 Cross Bay Blvd (Desarc Rd), Ozone Park...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.672935485839844_-73.84379577636719,42,2.0
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.678767_-73.843678,42,2.0
3,CJ's Bar & Lounge,Lounge,40.671836,-73.842968,423,[137-09 Crossbay Blvd (btwn Pitkin & 149th Ave...,40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67183636900458_-73.8429683018912,42,2.0
4,Mia Halal Food,Restaurant,40.680003,-73.844438,495,[105-07 Crossbay Blvd (LIBERTY AVE & 107TH AVE...,40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.68000268652495_-73.84443760252918,42,2.0


In [35]:
df_gym = df_nearby_venues[df_nearby_venues['categories'].str.contains("Gym")]

In [36]:
df_gym.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 257 entries, 2 to 11451
Data columns (total 12 columns):
name                257 non-null object
categories          257 non-null object
lat                 257 non-null float64
lng                 257 non-null float64
distance            257 non-null int64
formattedAddress    257 non-null object
ref_lat             257 non-null float64
ref_lng             257 non-null float64
id_ref              257 non-null object
id_loc              257 non-null object
qtd_spot_loc        257 non-null int64
qtd_gym_loc         257 non-null float64
dtypes: float64(5), int64(2), object(5)
memory usage: 26.1+ KB


In [37]:
df_gym.head()

Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.678767_-73.843678,42,2.0
19,Monkey Fist Martial Arts,Gym / Fitness Center,40.67288,-73.843907,298,"[135-26 Cross Bay Blvd. (Pitkin Ave.), Ozone P...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67288_-73.843907,42,2.0
43,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,38,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.678767_-73.843678,37,2.0
47,Blink Fitness,Gym,40.681373,-73.837935,558,"[102-16 Liberty Ave, Ozone Park, NY 11417, Uni...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.68137255319542_-73.83793505166966,37,2.0
82,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,429,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.682626,-73.843685,40.68262586982361_-73.8436849535023,40.678767_-73.843678,42,3.0


In [38]:
df_gym['categories'].value_counts()

Gym                     143
Gym / Fitness Center    104
Climbing Gym              7
Gymnastics Gym            3
Name: categories, dtype: int64

In [39]:
map_ny = folium.Map(location=ny_center, zoom_start=13)
folium.Marker(ny_center, popup='Highland Park').add_to(map_ny)

for index, res in df_gym.iterrows():
    
    lat = res['lat']; lon = res['lng']
    color = 'red' if res[1] == 'Gym' else  'blue' if res[1] == 'Gym / Fitness Center' else 'black'
    
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_ny)
map_ny

Looking good. So now we have all the restaurants in area within few kilometers from Highland Park, and we know which ones are Gym! We also know which restaurants exactly are in vicinity of every neighborhood candidate center.

This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for a new Italian restaurant!

# - Methodology <a name="methodology"></a>

In this project we will direct our efforts on detecting areas of New York that have low Gym density, particularly those with low number of Gym's. We will limit our analysis to area ~6km around city center.

In first step we have collected the required **data: location and type (category) of every restaurant within 6km from New York center** (Highland Park). We have also **identified Gym** (according to Foursquare categorization).

Second step in our analysis will be calculation and exploration of '**gym density**' across different areas of New York - we will use **heatmaps** to identify a few promising areas close to center with low number of restaurants in general (*and* no Gym in vicinity) and focus our attention on those areas.

In third and final step we will focus on most promising areas and within those create **clusters of locations that meet some basic requirements** established in discussion with stakeholders: we will take into consideration locations with **no more than two gym's in radius of 250 meters**, and we want locations **without Gym's in radius of 400 meters**. We will present map of all such locations but also create clusters (using **k-means clustering**) of those locations to identify general zones / neighborhoods / addresses which should be a starting point for final 'street level' exploration and search for optimal venue location by stakeholders.

# - Analysis <a name="analysis"></a>

Let's perform some basic explanatory data analysis and derive some additional info from our raw data. First let's count the **number of restaurants in every area candidate**:

In [40]:
df_gym.head()

Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.678767_-73.843678,42,2.0
19,Monkey Fist Martial Arts,Gym / Fitness Center,40.67288,-73.843907,298,"[135-26 Cross Bay Blvd. (Pitkin Ave.), Ozone P...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67288_-73.843907,42,2.0
43,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,38,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.678767_-73.843678,37,2.0
47,Blink Fitness,Gym,40.681373,-73.837935,558,"[102-16 Liberty Ave, Ozone Park, NY 11417, Uni...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.68137255319542_-73.83793505166966,37,2.0
82,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,429,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.682626,-73.843685,40.68262586982361_-73.8436849535023,40.678767_-73.843678,42,3.0


In [41]:
df_gym['qtd_gym_loc'].describe()

count    257.000000
mean       1.856031
std        0.947186
min        1.000000
25%        1.000000
50%        2.000000
75%        2.000000
max        4.000000
Name: qtd_gym_loc, dtype: float64

In [42]:
df_gym[df_gym['qtd_gym_loc'] == 1].shape

(116, 12)

In [None]:
def lonlat_to_xy(lon, lat):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    xy = pyproj.transform(proj_latlon, proj_xy, lon, lat)
    return xy[0], xy[1]

def xy_to_lonlat(x, y):
    proj_latlon = pyproj.Proj(proj='latlong',datum='WGS84')
    proj_xy = pyproj.Proj(proj="utm", zone=33, datum='WGS84')
    lonlat = pyproj.transform(proj_xy, proj_latlon, x, y)
    return lonlat[0], lonlat[1]

def calc_xy_distance(x1, y1, x2, y2):
    dx = x2 - x1
    dy = y2 - y1
    return math.sqrt(dx*dx + dy*dy)

In [45]:
xy_center = lonlat_to_xy(ny_center[1],ny_center[0])
print(xy_center)

df_gym['x_center'] = xy_center[1]
df_gym['y_center'] = xy_center[0]

df_gym['x_loc'] = lonlat_to_xy(df_gym['lng'],df_gym['lat'])[1]
df_gym['y_loc'] = lonlat_to_xy(df_gym['lng'],df_gym['lat'])[0]

df_gym.head()

  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
  del sys.path[0]


(-5826198.208165835, 9854210.041127639)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc,x_center,y_center,x_loc,y_loc
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.678767_-73.843678,42,2.0,9854210.0,-5826198.0,9848473.0,-5827453.0
19,Monkey Fist Martial Arts,Gym / Fitness Center,40.67288,-73.843907,298,"[135-26 Cross Bay Blvd. (Pitkin Ave.), Ozone P...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67288_-73.843907,42,2.0,9854210.0,-5826198.0,9848472.0,-5828453.0
43,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,38,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.678767_-73.843678,37,2.0,9854210.0,-5826198.0,9848473.0,-5827453.0
47,Blink Fitness,Gym,40.681373,-73.837935,558,"[102-16 Liberty Ave, Ozone Park, NY 11417, Uni...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.68137255319542_-73.83793505166966,37,2.0,9854210.0,-5826198.0,9847745.0,-5826988.0
82,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,429,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.682626,-73.843685,40.68262586982361_-73.8436849535023,40.678767_-73.843678,42,3.0,9854210.0,-5826198.0,9848473.0,-5827453.0


In [48]:
distances_min_to_gym = []
distances_to_gym_center = []

for index_1, row in df_gym.iterrows():
    print(index_1, end=' .')
    min_distance = 36000
    dis_center = float(calc_xy_distance( row['x_loc'],row['y_loc'],row['x_center'],row['y_center'] ))

    for index_2, res in df_gym.iterrows():
        if row['id_loc'] != res['id_loc'] :
            d = calc_xy_distance(row['x_loc'], row['y_loc'], res['x_loc'], res['y_loc'])
            if d < min_distance:
                min_distance = float(d)
    
    distances_min_to_gym.append(min_distance)
    distances_to_gym_center.append(dis_center)
    
df_gym['dist_min_to_gym'] = distances_min_to_gym
df_gym['dist_gym_center'] = distances_to_gym_center

df_gym.head()

2 .19 .43 .47 .82 .85 .86 .122 .163 .239 .270 .291 .299 .327 .367 .500 .523 .534 .555 .571 .775 .804 .848 .852 .871 .911 .918 .940 .956 .1186 .1199 .1211 .1228 .1269 .1281 .1303 .1318 .1324 .1342 .1362 .1586 .1593 .1612 .1630 .1660 .2070 .2085 .2091 .2104 .2141 .2168 .2179 .2591 .2619 .2649 .2662 .2701 .2715 .2742 .3057 .3071 .3092 .3096 .3121 .3146 .3181 .3283 .3318 .3354 .3383 .3484 .3501 .3504 .3557 .3569 .3593 .3646 .3736 .3738 .3763 .3775 .3791 .3887 .3914 .4028 .4060 .4141 .4152 .4169 .4170 .4190 .4193 .4197 .4215 .4222 .4313 .4406 .4455 .4584 .4604 .4627 .4629 .4633 .4644 .4654 .4655 .4666 .4735 .4749 .4779 .4796 .4805 .4819 .4838 .4887 .5075 .5097 .5109 .5113 .5169 .5180 .5215 .5222 .5266 .5279 .5344 .5361 .5443 .5545 .5562 .5630 .5670 .5679 .5732 .5740 .5786 .5816 .5836 .5869 .5890 .5904 .5937 .5970 .6083 .6091 .6116 .6173 .6275 .6355 .6427 .6466 .6505 .6553 .6567 .6585 .6607 .6608 .6637 .6639 .6744 .6768 .6816 .6829 .6839 .6842 .6882 .6900 .6905 .6968 .6970 .6972 .7069 .7111 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc,x_center,y_center,x_loc,y_loc,dist_min_to_gym,dist_gym_center
2,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,357,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.678767_-73.843678,42,2.0,9854210.0,-5826198.0,9848473.0,-5827453.0,864.248119,5872.491944
19,Monkey Fist Martial Arts,Gym / Fitness Center,40.67288,-73.843907,298,"[135-26 Cross Bay Blvd. (Pitkin Ave.), Ozone P...",40.675562,-73.843971,40.67556247717076_-73.84397073582448,40.67288_-73.843907,42,2.0,9854210.0,-5826198.0,9848472.0,-5828453.0,1000.171734,6165.257337
43,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,38,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.678767_-73.843678,37,2.0,9854210.0,-5826198.0,9848473.0,-5827453.0,864.248119,5872.491944
47,Blink Fitness,Gym,40.681373,-73.837935,558,"[102-16 Liberty Ave, Ozone Park, NY 11417, Uni...",40.679094,-73.843828,40.67909404965081_-73.84382785732734,40.68137255319542_-73.83793505166966,37,2.0,9854210.0,-5826198.0,9847745.0,-5826988.0,739.770346,6513.300277
82,Zumba® Crossbay Blvd,Gym,40.678767,-73.843678,429,"[10701 Crossbay Blvd (107th Ave), Ozone Park, ...",40.682626,-73.843685,40.68262586982361_-73.8436849535023,40.678767_-73.843678,42,3.0,9854210.0,-5826198.0,9848473.0,-5827453.0,864.248119,5872.491944


In [49]:
df_gym['dist_min_to_gym'].describe()

count     257.000000
mean      815.891686
std       552.171163
min        15.327330
25%       289.228944
50%       739.770346
75%      1190.054841
max      2017.815522
Name: dist_min_to_gym, dtype: float64

In [50]:
df_gym['dist_gym_center'].describe()

count     257.000000
mean     4123.913645
std      1478.254158
min       886.658367
25%      3095.805182
50%      4244.660503
75%      5525.262196
max      6513.300277
Name: dist_gym_center, dtype: float64

In [52]:
df_gym.describe()

Unnamed: 0,lat,lng,distance,ref_lat,ref_lng,qtd_spot_loc,qtd_gym_loc,x_center,y_center,x_loc,y_loc,dist_min_to_gym,dist_gym_center
count,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0,257.0
mean,40.689299,-73.890662,406.466926,40.68952,-73.890627,41.097276,1.856031,9854210.0,-5826198.0,9854598.0,-5825848.0,815.891686,4123.913645
std,0.017593,0.024545,141.284813,0.017501,0.024113,20.113794,0.947186,3.172675e-08,2.892733e-08,3165.546,2994.001,552.171163,1478.254158
min,40.65383,-73.933705,38.0,40.653447,-73.932348,6.0,1.0,9854210.0,-5826198.0,9847745.0,-5831950.0,15.32733,886.658367
25%,40.672445,-73.907536,304.0,40.674078,-73.908919,26.0,1.0,9854210.0,-5826198.0,9852243.0,-5828580.0,289.228944,3095.805182
50%,40.693504,-73.894016,441.0,40.695217,-73.894896,37.0,2.0,9854210.0,-5826198.0,9855106.0,-5824965.0,739.770346,4244.660503
75%,40.705351,-73.871676,529.0,40.704749,-73.870904,51.0,2.0,9854210.0,-5826198.0,9856831.0,-5823122.0,1190.054841,5525.262196
max,40.718632,-73.837935,599.0,40.718518,-73.843113,100.0,4.0,9854210.0,-5826198.0,9860203.0,-5820768.0,2017.815522,6513.300277


In [62]:
df_gym[ 
    (df_gym['qtd_spot_loc']<7) & 
    (df_gym['qtd_gym_loc']<2) & 
    (df_gym['dist_min_to_gym']>1190) & 
    (df_gym['dist_gym_center']<3095) 
].shape

(1, 18)

In [65]:
gym_select = df_gym[ 
    (df_gym['qtd_spot_loc']<7) & 
    (df_gym['qtd_gym_loc']<2) & 
    (df_gym['dist_min_to_gym']>1190) & 
    (df_gym['dist_gym_center']<3095) 
]

df_gym[ 
    (df_gym['qtd_spot_loc']<7) & 
    (df_gym['qtd_gym_loc']<2) & 
    (df_gym['dist_min_to_gym']>1190) & 
    (df_gym['dist_gym_center']<3095) 
]

Unnamed: 0,name,categories,lat,lng,distance,formattedAddress,ref_lat,ref_lng,id_ref,id_loc,qtd_spot_loc,qtd_gym_loc,x_center,y_center,x_loc,y_loc,dist_min_to_gym,dist_gym_center
6083,The Muse,Gym / Fitness Center,40.691459,-73.902458,499,"[350 Moffat St, Brooklyn, NY 11237, United Sta...",40.687445,-73.899802,40.68744512308892_-73.8998021815457,40.691459213155895_-73.90245841202055,6,1.0,9854210.0,-5826198.0,9856132.0,-5825527.0,1190.054841,2035.737552


In [80]:
gym_select[['lat','lng']].values # [['lat'],['lng']]

array([[ 40.69145921, -73.90245841]])

In [85]:
gym_select['id_loc'].values

array(['40.691459213155895_-73.90245841202055'], dtype=object)

In [86]:
df_gym.shape

(257, 18)

In [89]:
df_gym_2 = df_gym[ ( df_gym['id_loc'] != '40.691459213155895_-73.90245841202055' ) ]
df_gym_2.shape

(248, 18)

In [92]:
map_ny = folium.Map(location=ny_center, zoom_start=13)
folium.Marker(ny_center, popup='Highland Park').add_to(map_ny)
folium.Marker([ 40.69145921, -73.90245841], popup='Gym view').add_to(map_ny)

for index, res in df_gym_2.iterrows():
    
    lat = res['lat']; lon = res['lng']
    color = 'red' if res[1] == 'Gym' else  'blue' if res[1] == 'Gym / Fitness Center' else 'black'
    
    folium.CircleMarker([lat, lon], radius=3, color=color, fill=True, fill_color=color, fill_opacity=1).add_to(map_ny)
map_ny

# - Results and Discussion <a name="results"></a>

Our analysis shows that although there is a great number of gym in New York center (~2000 in our initial area of interest which was 12x12km around Highland Park), there are pockets of low gym density fairly close to city center. Highest concentration of gym was detected north and south from Highland Park, so we focused our attention to areas est and west, corresponding to boroughs Ocean Hill and Knews Gardens. Another borough was identified as potentially interesting (Ricmond Hill and Woodhaven, west from Highland Park), but our attention was focused on Ocean Hill and Knews Gardens which offer a combination of popularity among tourists, closeness to city center, strong socio-economic dynamics *and* a number of pockets of low gym density.

After directing our attention to this more narrow area of interest (covering approx. 5x5km south-east from Highland Park) we first created a dense grid of location candidates (spaced 100m appart); those locations were then filtered so that those with more than two business in radius of 250m and those with an Gym closer than 400m were removed. Those location candidates were then clustered to create zones of interest which contain greatest number of location candidates. Addresses of centers of those zones were also generated using reverse geocoding to be used as markers/starting points for more detailed local analysis based on other factors.


# - Conclusion <a name="conclusion"></a>

Purpose of this project was to identify New York city areas close to center with low number of business (particularly Gym) in order to aid stakeholders in narrowing down the search for optimal location for a new Gym. By calculating gym density distribution from Foursquare data we have first identified general boroughs that justify further analysis (**Ocean Hill and Knews Gardens**), and then generated extensive collection of locations which satisfy some basic requirements regarding existing nearby business. Clustering of those locations was then performed in order to create major zones of interest (containing greatest number of potential locations) and addresses of those zone centers were created to be used as starting points for final exploration by stakeholders.

Final decission on optimal Gym location will be made by stakeholders based on specific characteristics of neighborhoods and locations in every recommended zone, taking into consideration additional factors like attractiveness of each location (proximity to park or water), levels of noise / proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.