# Battle of Neighborhoods: San Jose vs New York

## Anish S

### Introduction

In this capstone project, the aim is to locate the biggest cluster of restaurants/coffee shops in San Jose and New York, in order to determine which city would be a better location for a startup to supply green cutlery. From this, we can get the center of the clusters as the best location, so we can decide the better location for the startup in both cities, as well as compare the areas in each city.


### Data

Based on the problem, the required data includes:
* location of the two cities
* all the coffee shops in the two cities

To acquire the data, we need two sources:
* geopy to get the coordinates of the two cities
* foursquare api to get all the coffeeshops in the cities

#### First we get the locations of the two cities

In [232]:
# We can use this to get the coordinates of the cities if we give the address of the cities
from geopy.geocoders import Nominatim

In [233]:
# To actually get the coordinates, we simply define the addresses of the two cities:
add_San = 'San Jose, CA'
add_York = 'New York City, NY'

# Then define an instance of the geolocator
geolocator = Nominatim(user_agent="explorer")

# Then get the location for each city and assign it to an list
location = geolocator.geocode(add_San)
coords_San = [location.latitude, location.longitude]
location = geolocator.geocode(add_York)
coords_York = [location.latitude, location.longitude]

# Show our results:
print('San Jose\'s coordinates are:', coords_San, ' and New York\'s coordinates are:', coords_York)

San Jose's coordinates are: [37.3361905, -121.8905833]  and New York's coordinates are: [40.7127281, -74.0060152]


In [109]:
# We can show the maps of the two cities as follows:
import folium

In [294]:
# Use the above imported library to define a map for both cities:
map_San = folium.Map(location=coords_San, zoom_start=12)
map_York = folium.Map(location=coords_York, zoom_start = 12)

In [295]:
# Then display them:
map_San

In [296]:
map_York

Now that we've located the two cities, and even have maps of the two, it is time to locate all the venues.

In [297]:
# First importing libraries
import requests # library for working with foursquare requests
import json # library for working with json files (which are returned from a foursquare request)
from pandas.io.json import json_normalize # used for making the json file into a dataframe
import pandas # library for working with dataframes

In [298]:
# all the parameters required to make the foursquare url

CLIENT_ID = '0E0SHR0UJ0RR0UQIBCIGO20HBCGV3DY5N3IQNPFQTRHWFAOU' # your Foursquare ID
CLIENT_SECRET = 'INNJXIJR51XLV24HFRTPLL5UE1O0JWF0DOF1OLRYWNXVMIAO' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 1000
radius = 1000

url_San = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, coords_San[0], coords_San[1], radius, LIMIT)

url_York = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, VERSION, coords_York[0], coords_York[1], radius, LIMIT)

In [299]:
# getting the returned json files from foursquare
json_San = requests.get(url_San).json()
json_York = requests.get(url_York).json()

In [300]:
# use this function to get the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [301]:
# getting just the venues, then cleaning it up
venues_San = json_San['response']['groups'][0]['items']
venues_York = json_York['response']['groups'][0]['items']
    
venues_San = json_normalize(venues_San) # flatten JSON
venues_York = json_normalize(venues_York)

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
venues_San = venues_San.loc[:, filtered_columns]
venues_York = venues_York.loc[:, filtered_columns]

# filter the category for each row
venues_San['venue.categories'] = venues_San.apply(get_category_type, axis=1)
venues_York['venue.categories'] = venues_York.apply(get_category_type, axis=1)

# clean columns
venues_San.columns = [col.split(".")[-1] for col in venues_San.columns]
venues_York.columns = [col.split(".")[-1] for col in venues_York.columns]

venues_San.head()

  """
  


Unnamed: 0,name,categories,lat,lng
0,Ike's Sandwiches,Sandwich Place,37.336852,-121.889436
1,San Pedro Square,Plaza,37.335179,-121.893044
2,Back A Yard Caribbean American Grill,Caribbean Restaurant,37.336683,-121.892749
3,Original Gravity Public House,Pub,37.335052,-121.889747
4,ISO: Beers,Beer Bar,37.336945,-121.889161


In [302]:
# now that we have the nearby venues, we can isolate the food-related ones for each dataframe

# first we find out which categories there are:
categories_San = list(set(venues_San['categories'])) # 1, 3, 6, 16, 18, 23, 24, 27, 28, 31, 34, 40, 42, 46, 47, 48 are all that aren't food, so get rid of them:
bad_San = [1, 3, 6, 16, 18, 23, 24, 27, 28, 31, 34, 40, 42, 46, 47, 48]
for value in bad_San:
    venues_San = venues_San[venues_San['categories'] != categories_San[value]]
venues_San.head()

Unnamed: 0,name,categories,lat,lng
0,Ike's Sandwiches,Sandwich Place,37.336852,-121.889436
1,San Pedro Square,Plaza,37.335179,-121.893044
2,Back A Yard Caribbean American Grill,Caribbean Restaurant,37.336683,-121.892749
3,Original Gravity Public House,Pub,37.335052,-121.889747
4,ISO: Beers,Beer Bar,37.336945,-121.889161


In [303]:
# do the same for new york
categories_York = list(set(venues_York['categories'])) # 0, 1, 3, 7, 11, 12, 13, 17, 18, 22, 23, 25, 26, 30, 32, 33, 34, 38, 40, 45, 46, 48, 50, 52, 53, 54, 55, 57, 58, 60, 61, 62 are all non-food
bad_York = [0, 1, 3, 7, 11, 12, 13, 17, 18, 22, 23, 25, 26, 30, 32, 33, 34, 38, 40, 45, 46, 48, 50, 52, 53, 54, 55, 57, 58, 60, 61, 62]
for value in bad_York:
    venues_York = venues_York[venues_York['categories'] != categories_York[value]]
venues_York.head()

Unnamed: 0,name,categories,lat,lng
0,The Bar Room at Temple Court,Hotel Bar,40.711448,-74.006802
4,The Wooly Daily,Coffee Shop,40.712137,-74.008395
7,Takahachi Bakery,Bakery,40.713653,-74.008804
10,Los Tacos No. 1,Taco Place,40.714267,-74.008756
11,Pisillo Italian Panini,Sandwich Place,40.71053,-74.007526


In [304]:
# now we have all the food venues, let us put it on maps:

# assign the maps
map_San = folium.Map(location=coords_San, zoom_start = 15)
map_York = folium.Map(location=coords_York, zoom_start = 15)

# add the store labels
for lat, lng, venue in zip(venues_San['lat'], venues_San['lng'], venues_San['name']):
    label = folium.Popup(venue, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_San)
for lat, lng, venue in zip(venues_York['lat'], venues_York['lng'], venues_York['name']):
    label = folium.Popup(venue, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=3,
        popup=label,
        color='red',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_York)

In [305]:
map_San

In [306]:
map_York

As can be seen, we have now acquried the data for the location of the venues. Next, the analysis.

### Methodology

The efforts will now be applied to finding the best location for the startup. Given the nature of the startup, there are a few criteria.

Because the startup is supplying tableware, it should be in the middle of the area with the most shops, to minimize shipping/delivery costs.

To do this, first the venues will be clustered, to seperate the shops into different areas. Next, the clusters will be compared to see how many shops are in each cluster. Finally, the center of cluster with the most venues would be the best cluster for the startup, because it can deliver to the most shops at the lowest price.

The clustering itself will be done with k-means clustering, and then the final centers will give the coordinates for the startup.

In [307]:
# matplotlib and numpy to use for generation of the colors for the venues on the map in differing clusters
import matplotlib.cm as cm
import matplotlib.colors as colors
import numpy as np
# library below used for the k-means clustering
from sklearn.cluster import KMeans

In [308]:
# performing the k-cluster algorithm with three clusters
cluster_count = 3

clus_San = KMeans(n_clusters=cluster_count, n_init=15, random_state=0).fit(venues_San[['lat', 'lng']])
clus_York = KMeans(n_clusters=cluster_count, n_init=15, random_state=0).fit(venues_York[['lat', 'lng']])

In [309]:
# then putting which cluster each venue is in into the dataframes
venues_San.insert(2, 'labels', clus_San.labels_)
venues_York.insert(2, 'labels', clus_York.labels_)

In [310]:
# then we can map out both maps and show which cluster each venue is in:

# first, build the maps
map_San = folium.Map(location=coords_San, zoom_start = 15)
map_York = folium.Map(location=coords_York, zoom_start = 15)

# then the clusters: first the color scheme for the groups:
x = np.arange(cluster_count)
ys = [i + x + (i*x)**2 for i in range(cluster_count)]
colors_array = cm.rainbow(np.linspace(0, 0.8, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# then add the venues to the maps:
for lat, lon, cluster, name in zip(venues_San['lat'], venues_San['lng'], venues_San['labels'], venues_San['name']):
    label = folium.Popup(str(name) + ': Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_San)
for lat, lon, cluster, name in zip(venues_York['lat'], venues_York['lng'], venues_York['labels'], venues_York['name']):
    label = folium.Popup(str(name) + ': Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_York)

# finally, plotting the cluster centers:
folium.CircleMarker(clus_San.cluster_centers_[0], radius=7, popup='Center of Cluster 0', color=rainbow[0], fill=True, fill_color=rainbow[0], fill_opacity=0.7).add_to(map_San)
folium.CircleMarker(clus_San.cluster_centers_[1], radius=7, popup='Center of Cluster 0', color=rainbow[1], fill=True, fill_color=rainbow[1], fill_opacity=0.7).add_to(map_San)
folium.CircleMarker(clus_San.cluster_centers_[2], radius=7, popup='Center of Cluster 0', color=rainbow[2], fill=True, fill_color=rainbow[2], fill_opacity=0.7).add_to(map_San)
folium.CircleMarker(clus_York.cluster_centers_[0], radius=7, popup='Center of Cluster 0', color=rainbow[0], fill=True, fill_color=rainbow[0], fill_opacity=0.7).add_to(map_York)
folium.CircleMarker(clus_York.cluster_centers_[1], radius=7, popup='Center of Cluster 0', color=rainbow[1], fill=True, fill_color=rainbow[1], fill_opacity=0.7).add_to(map_York)
folium.CircleMarker(clus_York.cluster_centers_[2], radius=7, popup='Center of Cluster 0', color=rainbow[2], fill=True, fill_color=rainbow[2], fill_opacity=0.7).add_to(map_York)

<folium.features.CircleMarker at 0x7f3827cc3ba8>

In [311]:
map_San # map of san jose

In [312]:
map_York # map of new york

Now that the maps with cluster and centers has been seen, it is time to figure out the cluster between all the clusters with the greatest number of venues.

In [313]:
# to compare the clusters, first we can get the index of the cluster with the most venues and store it:
max_index_San = venues_San.groupby('labels')['name'].count().idxmax()
counts_San = venues_San.groupby('labels')['name'].count().rename('venues_per_cluster_San')
max_index_York = venues_York.groupby('labels')['name'].count().idxmax()
counts_York = venues_York.groupby('labels')['name'].count().rename('venues_per_cluster_York')

In [314]:
# then based on that, set the set of 'bigger_cluster' variables to that city and its data, to make later use easier
if counts_San[max_index_San] > counts_York[max_index_York]:
    bigger_cluster_city = 'San Jose'
    bigger_cluster_index = max_index_San
    bigger_cluster_counts = counts_San
    bigger_cluster_venues = venues_San
    bigger_cluster_kmeans = clus_San
    bigger_cluster_coords = coords_San
else:
    bigger_cluster_city = 'New York'
    bigger_cluster_index = max_index_York
    bigger_cluster_counts = counts_York
    bigger_cluster_venues = venues_York
    bigger_cluster_kmeans = clus_York
    bigger_cluster_coords = coords_York

In [315]:
print('The city with the biggest cluster is ' + bigger_cluster_city + ', and the cluster is Cluster ' + str(bigger_cluster_index))

The city with the biggest cluster is San Jose, and the cluster is Cluster 2


Now that we know the city with the bigger cluster, we know which city to use in the final analysis. The last step, is to get a final map with the specified cluster only and the center of the cluster's coordinates spelled out, so the best location coordinates-wise is clear.

In [316]:
# the final map
final_map = folium.Map(location=list(bigger_cluster_kmeans.cluster_centers_[bigger_cluster_index]), zoom_start=16)

# adding the one cluster
cluster = bigger_cluster_venues[bigger_cluster_venues['labels'] == bigger_cluster_index]

for lat, lon, labels, name in zip(cluster['lat'], cluster['lng'], cluster['labels'], cluster['name']):
    label = folium.Popup(str(name) + ': Cluster ' + str(labels), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=3,
        popup=label,
        color='black',
        fill=True,
        fill_color='white',
        fill_opacity=0.7).add_to(final_map)

# adding the final center of the cluster
folium.CircleMarker(bigger_cluster_kmeans.cluster_centers_[bigger_cluster_index], radius=7, popup='Center of Cluster ' +
                    str(bigger_cluster_index), color='black', fill=True, fill_color='white', fill_opacity=0.7).add_to(final_map)

# displaying the map
final_map

In [317]:
# printing out the coordinates for the location of the startup
print('The coordinates for the best location of the startup would be ' + str(list(bigger_cluster_kmeans.cluster_centers_[bigger_cluster_index])) + '.')

The coordinates for the best location of the startup would be [37.336067355833904, -121.89415825073274].


In [318]:
# our final answer: getting the location for the coordinates and printing out that as the best location for the startup
final_address = geolocator.reverse(bigger_cluster_kmeans.cluster_centers_[bigger_cluster_index])
print('The best location for the startup would be: ' + final_address[0])

The best location for the startup would be: The Old Spaghetti Factory, North San Pedro Street, San Pedro Square, Japantown, San Jose, Santa Clara County, California, 95113, United States of America


### Results

Based on the results of the k-clustering and final data analysis, the best place to put a startup concerned with recyclable tableware would be The Old Spaghetti Factory, when comparing between San Jose and New York and considering the costs of delivery. Putting the startup in this location would decrease the costs to deliver for the most venues, increasing profit.

Later adjustments could include factoring in many other conditions, such as surveys of people who prefer green products, and brand-name venues, which would already have a supplier for their tableware.

### Conclusion

This project was meant to locate the best area for a startup based around single-use tableware. Should anyone need this data, it is here for them. This project was due to concern for the environment and hope that there are those who would work toward improving it.