# Capstone Project - The Battle of the Neighborhoods
### IBM Applied Data Science Capstone

## Opening Beverage Distribution Warehouses in the Staten Island, New York City

**By Dilip Mistry**

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)



## Introduction: Business Problem <a name="introduction"></a>

In this project, we will try to find some optimal locations for warehouse to deliver beverages to nearby restaurants, coffee shops and bars. Specifically, this report will be targeted to stakeholders interested in opening some beverage warehouses in **Staten Island** of New York City, USA.

The business problem of the project is that there are no central warehouse in the Staten Island and the distribution takes place from other boroughs of New York City. The distribution from other areas to the targeted areas of Staten Island became time consuming, inefficient and costly. Now, the client wants to open several warehouses to supply beverages to the restaurants, coffee shops and bars quickly and reduce the costs of the distribution of beverages. Therefore, the client wants to know which locations would be the best locations for his business in order to minimize the relative distance from each of the venues.

This project will use some data science techniques to generate neighborhoods in the Staten Island areas based on the criteria so that the client can choose the best possible five locations for warehouses and distribute beverages to the restaurants, coffee shops and bars with efficiently and cost effectively.

## Data <a name="data"></a>

This project needs the following data to solve the problem:
* List of neighborhoods with coordinate information in the Staten Island
* Venue data, particularly related to restaurants, coffee shops and bars

This project needs geographical data of Staten Island neighborhoods including its latitude and longitude information as well as location data describing places and venues. The data in this report obtained from https://geo.nyu.edu/catalog/nyu_2451_34572 and contained boroughs and list of neighborhoods in New York City. However, the list of neighborhoods in the Staten Island have been used in this report.The data also contains the geographical coordinates of the neighborhoods.

The information regarding restaurants, coffee shops and bars in every neighborhood of Staten Island will be obtained using **Foursquare API**. The Foursquare API will provide detailed location information, including name of location, categories for the location, geographical coordinates of the location, etc. This data will be used to determine the best 5 warehouses for distributing beverages in Staten Island, NY.

### Before we get the data and start exploring it, let's import all the dependencies that we will need.

In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

### Download Dataset
Here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572
For the convenience, this dataset has been downloaded and placed it on the local drive. 

In [2]:
# load the data.
with open('nyu_2451_34572-geojson.json') as json_data:
    ny_data = json.load(json_data)

Let's take a quick look at the data.

In [3]:
ny_data

{'type': 'FeatureCollection',
 'totalFeatures': 306,
 'features': [{'type': 'Feature',
   'id': 'nyu_2451_34572.1',
   'geometry': {'type': 'Point',
    'coordinates': [-73.84720052054902, 40.89470517661]},
   'geometry_name': 'geom',
   'properties': {'name': 'Wakefield',
    'stacked': 1,
    'annoline1': 'Wakefield',
    'annoline2': None,
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.84720052054902,
     40.89470517661,
     -73.84720052054902,
     40.89470517661]}},
  {'type': 'Feature',
   'id': 'nyu_2451_34572.2',
   'geometry': {'type': 'Point',
    'coordinates': [-73.82993910812398, 40.87429419303012]},
   'geometry_name': 'geom',
   'properties': {'name': 'Co-op City',
    'stacked': 2,
    'annoline1': 'Co-op',
    'annoline2': 'City',
    'annoline3': None,
    'annoangle': 0.0,
    'borough': 'Bronx',
    'bbox': [-73.82993910812398,
     40.87429419303012,
     -73.82993910812398,
     40.87429419303012]}},
  {'type': 'Feature',
 

We see that the list of the neighborhoods data is in the features key. So, let's define a new variable that includes this data.

In [4]:
neighborhoods_data = ny_data['features']

Let's take a look at the first item in this list.

In [5]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### Tranform the data into a *pandas* dataframe

Create an empty dataframe.

In [6]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [7]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

Examine the resulting dataframe.

In [8]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


Filter neighborhood data and keep only **Staten Island** neighbourhood data

In [9]:
staten_data = neighborhoods[neighborhoods['Borough']=='Staten Island'].reset_index(drop=True)
staten_data

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Staten Island,St. George,40.644982,-74.079353
1,Staten Island,New Brighton,40.640615,-74.087017
2,Staten Island,Stapleton,40.626928,-74.077902
3,Staten Island,Rosebank,40.615305,-74.069805
4,Staten Island,West Brighton,40.631879,-74.107182
5,Staten Island,Grymes Hill,40.624185,-74.087248
6,Staten Island,Todt Hill,40.597069,-74.111329
7,Staten Island,South Beach,40.580247,-74.079553
8,Staten Island,Port Richmond,40.633669,-74.129434
9,Staten Island,Mariner's Harbor,40.632546,-74.150085


#### Use geopy library to get the latitude and longitude values of Staten Island

Let's get the geographical coordinates of Staten Island.

In [10]:
address = 'Staten Island, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Staten Island are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Staten Island are 40.5834557, -74.1496048.


#### Create a map of Staten Island with neighborhoods superimposed on top.

In [11]:
# create map of New York using latitude and longitude values
map_staten = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(staten_data['Latitude'], staten_data['Longitude'], staten_data['Borough'], staten_data['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_staten)  
    
map_staten

Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

### Foursquare

Now that we have our location candidates, let's use Foursquare API to get info on restaurants, coffee shops, and bars in each neighborhood.

Foursquare credentials are defined bellow.

### Define Foursquare Credentials and Version

In [12]:
CLIENT_ID = 'XL24KM52GVRWBU5UIKYP0F4DBORXRBT0DSC01XII0UVDXAVP' # your Foursquare ID
CLIENT_SECRET = 'DEGIYUU1SRPD0WFRNRCHPO0BXQLLHZJRTN4I2KY1DJCGMFTE' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: XL24KM52GVRWBU5UIKYP0F4DBORXRBT0DSC01XII0UVDXAVP
CLIENT_SECRET:DEGIYUU1SRPD0WFRNRCHPO0BXQLLHZJRTN4I2KY1DJCGMFTE


### Explore Neighborhoods in Staten Island

Let's create a function to repeat the same process to all the neighborhoods in Staten Island

In [13]:
LIMIT = 100 # limit of number of venues returned by Foursquare API

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

Now write the code to run the above function on each neighborhood and create a new dataframe called staten_venues.

In [14]:
staten_venues = getNearbyVenues(names=staten_data['Neighborhood'],
                                   latitudes=staten_data['Latitude'],
                                   longitudes=staten_data['Longitude']
                                  )

St. George
New Brighton
Stapleton
Rosebank
West Brighton
Grymes Hill
Todt Hill
South Beach
Port Richmond
Mariner's Harbor
Port Ivory
Castleton Corners
New Springville
Travis
New Dorp
Oakwood
Great Kills
Eltingville
Annadale
Woodrow
Tottenville
Tompkinsville
Silver Lake
Sunnyside
Park Hill
Westerleigh
Graniteville
Arlington
Arrochar
Grasmere
Old Town
Dongan Hills
Midland Beach
Grant City
New Dorp Beach
Bay Terrace
Huguenot
Pleasant Plains
Butler Manor
Charleston
Rossville
Arden Heights
Greenridge
Heartland Village
Chelsea
Bloomfield
Bulls Head
Richmond Town
Shore Acres
Clifton
Concord
Emerson Hill
Randall Manor
Howland Hook
Elm Park
Manor Heights
Willowbrook
Sandy Ground
Egbertville
Prince's Bay
Lighthouse Hill
Richmond Valley
Fox Hills


In [15]:
staten_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,St. George,40.644982,-74.079353,Richmond County Bank Ballpark,40.645056,-74.076864,Baseball Stadium
1,St. George,40.644982,-74.079353,Staten Island September 11 Memorial,40.646767,-74.07651,Monument / Landmark
2,St. George,40.644982,-74.079353,Beso,40.643306,-74.076508,Tapas Restaurant
3,St. George,40.644982,-74.079353,A&S Pizzeria,40.64394,-74.077626,Pizza Place
4,St. George,40.644982,-74.079353,St. George Theatre,40.642253,-74.077496,Theater
5,St. George,40.644982,-74.079353,Enoteca Maria,40.641941,-74.07732,Italian Restaurant
6,St. George,40.644982,-74.079353,Steiny's Pub,40.642185,-74.076599,Bar
7,St. George,40.644982,-74.079353,Hypno-Tronic Comics,40.642476,-74.076587,Toy / Game Store
8,St. George,40.644982,-74.079353,The Gavel Grill,40.642157,-74.076674,American Restaurant
9,St. George,40.644982,-74.079353,Ruddy & Dean,40.644074,-74.076683,Bar


## Methodology <a name="methodology"></a>

In this project, we choose to use k-means algorithms in order to define the location of each warehouse. In first step, we have collected the required data such as location and type (category) of every restaurant, coffee shop, and bar in the Staten Island. In the second step, we will show all the locations in the map and merge all the locations of the restaurants, coffee shops and bars in a new dataframe and then find the central location and the standard deviation in location. In third and final step, we will focus on optimal locations and within those create **clusters of locations** established in discussion with the client.

By default, the **k-means clustering** algorithm minimize the distance of each point in a cluster from the centroid of the cluster. As a result, the cluster points of the k-means algorithm will be at the minimum distance of the determined centroid. We used the set of neighborhood locations (Latitude, Longitude) of the venues of interest. All venues belonging to restaurants, coffee shops, and bars are included in the dataset. We will present map of all such locations and will create clusters (using **k-means clustering**) of those locations to identify general neighborhoods and search for optimal warehouse locations for the client.

## Analysis <a name="analysis"></a>

### Find the coffee shop, restaurants, and bars in venues and create a new dataframe for each of them

Let's perform some basic explanatory data analysis and derive some additional info from our data.

### Restaurants

In [16]:
staten_restaurants = staten_venues[staten_venues['Venue Category'].str.contains('Restaurant') ==1]
staten_restaurants.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
2,St. George,40.644982,-74.079353,Beso,40.643306,-74.076508,Tapas Restaurant
5,St. George,40.644982,-74.079353,Enoteca Maria,40.641941,-74.07732,Italian Restaurant
8,St. George,40.644982,-74.079353,The Gavel Grill,40.642157,-74.076674,American Restaurant
14,St. George,40.644982,-74.079353,The Burrito Shoppe,40.643639,-74.077919,Fast Food Restaurant
18,St. George,40.644982,-74.079353,Not Guilty Deli,40.641867,-74.077016,American Restaurant
34,Stapleton,40.626928,-74.077902,Lakruwana,40.625654,-74.075174,Sri Lankan Restaurant
36,Stapleton,40.626928,-74.077902,Vida,40.628723,-74.079802,Restaurant
37,Stapleton,40.626928,-74.077902,Bay House Bistro,40.627827,-74.076244,Asian Restaurant
39,Stapleton,40.626928,-74.077902,Campo Bello Restaurante,40.624463,-74.079652,Spanish Restaurant
43,Stapleton,40.626928,-74.077902,El Patron Restaurant & Lounge Inc.,40.629154,-74.076541,Mexican Restaurant


Number of Restaurants

In [17]:
print('Total number of Restaurants:', len(staten_restaurants))

Total number of Restaurants: 160


### Coffee Shops

In [18]:
staten_coffee_shops = staten_venues[staten_venues['Venue Category']=='Coffee Shop']
staten_coffee_shops.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
90,West Brighton,40.631879,-74.107182,Fab Cup,40.630129,-74.109486,Coffee Shop
91,West Brighton,40.631879,-74.107182,Starbucks,40.63032,-74.106087,Coffee Shop
94,West Brighton,40.631879,-74.107182,Beans & Leaves,40.630718,-74.103376,Coffee Shop
175,New Springville,40.594252,-74.16496,Barnes & Noble Cafe,40.592385,-74.162901,Coffee Shop
264,Eltingville,40.542231,-74.164331,Country Donuts,40.545532,-74.160893,Coffee Shop
306,Woodrow,40.541968,-74.205246,Starbucks,40.542153,-74.207067,Coffee Shop
367,Park Hill,40.60919,-74.080157,Starbucks,40.606393,-74.078682,Coffee Shop
383,Arlington,40.635325,-74.165104,Unique Coffee Roasters,40.63737,-74.162424,Coffee Shop
589,Charleston,40.530531,-74.232158,Starbucks,40.52828,-74.233315,Coffee Shop
614,Arden Heights,40.549286,-74.185887,APlus at Sunoco,40.551233,-74.18403,Coffee Shop


Number of Coffee Shops

In [19]:
print('Total number of Coffee Shops:', len(staten_coffee_shops))

Total number of Coffee Shops: 12


### Bars

In [20]:
staten_bars = staten_venues[(staten_venues['Venue Category'].str.contains('Bar') ==1) & (staten_venues['Venue Category'].str.contains('Barbershop')==0)]
staten_bars.head(10)

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
6,St. George,40.644982,-74.079353,Steiny's Pub,40.642185,-74.076599,Bar
9,St. George,40.644982,-74.079353,Ruddy & Dean,40.644074,-74.076683,Bar
38,Stapleton,40.626928,-74.077902,The Hop Shoppe,40.629034,-74.079758,Beer Bar
40,Stapleton,40.626928,-74.077902,Vinum Wine Bar & Cafe,40.624853,-74.07489,Bar
100,West Brighton,40.631879,-74.107182,Jody's Club Forest,40.630894,-74.101912,Bar
107,West Brighton,40.631879,-74.107182,Better Gourmet Health Kitchen,40.630881,-74.102813,Juice Bar
109,West Brighton,40.631879,-74.107182,Liberty Tavern,40.630929,-74.10213,Bar
152,Port Ivory,40.639683,-74.174645,Jonesys Tavern,40.63964,-74.171252,Bar
165,Castleton Corners,40.613336,-74.119181,Danny Boy's,40.613212,-74.123846,Bar
178,New Springville,40.594252,-74.16496,1001 Nights Cafe,40.598126,-74.162204,Hookah Bar


Number of Bars

In [21]:
print('Total number of Bars:', len(staten_bars))

Total number of Bars: 24


This concludes the data gathering phase - we're now ready to use this data for analysis to produce the report on optimal locations for warehouses for beverages

### Mark the venues of interest in the map of Staten Island

In [22]:
address = 'Staten Island, NY'

geolocator = Nominatim()
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

map_staten_island = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat, lng, neighborhood in zip(staten_coffee_shops['Neighborhood Latitude'], staten_coffee_shops['Neighborhood Longitude'], staten_coffee_shops['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_staten_island) 

for lat, lng, neighborhood in zip(staten_restaurants['Neighborhood Latitude'], staten_restaurants['Neighborhood Longitude'], staten_restaurants['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='red',
        fill=True,
        fill_color='red',
        fill_opacity=0.7,
        parse_html=False).add_to(map_staten_island)  
    
for lat, lng, neighborhood in zip(staten_bars['Neighborhood Latitude'], staten_bars['Neighborhood Longitude'], staten_bars['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='green',
        fill_opacity=0.7,
        parse_html=False).add_to(map_staten_island) 

map_staten_island

  This is separate from the ipykernel package so we can avoid doing imports until


* Concatanate the locations of the venues in a new dataframe
*  Find the central location and the standard deviation in location (x,y)

In [23]:
warehouse = pd.concat([staten_coffee_shops, staten_restaurants, staten_bars])
data =warehouse[['Neighborhood Latitude', 'Neighborhood Longitude']].values
x0 =np.mean(data[:,0])
y0=np.mean(data[:,1])

### Perform a k-means clustering to define the 5 clusters of venues 
- Use the location features as inputs

In [24]:
# set number of clusters
kclusters = 5
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=1).fit(data)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:] 

warehouse['Labels'] = kmeans.labels_

### Mark the venues again on the map of Staten Island
- Use different colors to indicate the clusters

In [25]:
map_SI = folium.Map(location=[latitude, longitude], zoom_start=12)
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for lat, lng, cluster in zip(warehouse['Neighborhood Latitude'], warehouse['Neighborhood Longitude'], warehouse['Labels']):
    label = cluster
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_SI)  
map_SI

### Function returning the mean radius of each cluster

In [26]:
def mean_radius(cl,loc):
    num=0
    sum_r = 0
    for x,y in zip(cl['Neighborhood Latitude'],cl['Neighborhood Longitude']):
        r= np.sqrt(((x-loc[0])*110)**2+((y-loc[1])*110*np.cos(x))**2)
        
        sum_r = sum_r+r
        num=num+1
    mean_r = sum_r/num
   
    return mean_r,num

### Find the number of venues and the mean radius before clustering¶

In [27]:
wh=warehouse
loc=[x0,y0]
r,num_all=mean_radius(wh,loc)
print(loc,r, num_all)

[40.59007903962348, -74.12469856234326] 5.671608797231721 196


### Find the number of venues and mean radius for each of the five clusters

In [28]:
locations =[]
radii=[]
num_per_cluster=[]
for i in range(kclusters):
    wh=warehouse[warehouse['Labels']==i]
    loc=kmeans.cluster_centers_[i]
    r,num=mean_radius(wh,loc)
    locations.append(loc)
    radii.append(r)
    num_per_cluster.append(num)
locations =np.array(locations)
raddi=np.array(radii)

### Calculate the density of Each Cluster

In [29]:
num_per_cluster

[77, 20, 38, 32, 29]

Find the Latitude and Longitude of the locations

In [30]:
locations

array([[ 40.61999986, -74.08230212],
       [ 40.53041527, -74.21610762],
       [ 40.57688823, -74.10999616],
       [ 40.61142281, -74.15570007],
       [ 40.54551428, -74.15928435]])

In [31]:
3*np.array(radii)

array([6.21416501, 5.88775118, 4.1265604 , 5.93557725, 3.61063342])

In [32]:
area = (np.pi*(3*np.array(radii))**2)
area

array([121.31526042, 108.90523813,  53.49661281, 110.6816937 ,
        40.95591821])

In [33]:
(np.array(num_per_cluster))/area

array([0.63470993, 0.1836459 , 0.71032535, 0.28911737, 0.70807837])

### Mark each cluster on the map -Use variable radius for each cluster to indicate its radius

Let us now **cluster** those locations to create **centers of zones containing good locations**. Those zones, their centers and addresses will be the final result of our analysis. 

In [34]:
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

map_SI = folium.Map(location=[latitude, longitude], zoom_start=12)
for lat, lng, r, lbl in zip(locations[:,0], locations[:,1], radii, range(kclusters)):
    label = lbl
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=r*50,
        popup=label,
        color=rainbow[lbl-1],
        fill=True,
        fill_color=rainbow[lbl-1],
        fill_opacity=0.7,
        parse_html=False).add_to(map_SI) 
    print(lat,lng,r)
map_SI

40.61999986228201 -74.08230212000264 2.0713883357340106
40.53041526787677 -74.21610762052721 1.9625837265770283
40.576888232852255 -74.10999615933883 1.375520133322211
40.611422813016034 -74.15570007277688 1.9785257509078122
40.545514280967396 -74.1592843511998 1.2035444719507211


This concludes our analysis. In order to determine the radius of each cluster, descriptive statistics were used and thus, the area in which the venues of interest will be served by the specific warehouse. Moreover, in order to examine the result, the density of the venues in each area was calculated. Mean values were used to determine the exact location of each warehouse. This project created 5 zones representing their centers and addresses will be the final result. Although zones are shown on map, their shape is actually very irregular and their centers should be considered only as a starting point for exploring area neighborhoods in search for potential warehouse locations.

## Results and Discussion <a name="results"></a>

The location of the warehouse was identified by the centroids of the clusters calculated by the k-means clustering algorithm. The addresses can be extracted from the location’s latitude and longitude values. The area of each cluster was measured by calculating the mean radius of each cluster. Finally, the density of each cluster was calculated from the number of venues based on the total area of each cluster. The results are shown in the table below.

| Cluster Number | Color of Cluster | Latitude | Longitude | Mean Radius (KM) | Area (KM2) | Number_Venues | Density |
|----------------|------------------|----------|-----------|------------------|------------|---------------|-------- |
| Warehouse 1    | Red              | 40.6199  | -74.0823  | 2.07             | 121.31     | 77            | 0.63    |
| Warehouse 2    | Purple           | 40.5304  | -74.2161  | 1.96             | 108.91     | 20            | 0.18    |
| Warehouse 3    | Blue             | 40.5768  | -74.1099  | 1.38             | 53.50      | 38            | 0.71    |
| Warehouse 4    | Green            | 40.6114  | -74.1557  | 1.98             | 110.68     | 32            | 0.29    |
| Warehouse 5    | Orange           | 40.5455  | -74.1592  | 1.20             | 40.96      | 29            | 0.71    |

The clustering map shows the distribution of restaurants, coffee shops, and bars in the Staten Island area in five clustering zones. The red, purple and green clusters hold the larger area ranging from 109~121 sq km. Whereas, the blue and orange clusters hold smaller area ranging from 41~54 sq km. The red cluster contains the largest number of restaurants, coffee shops, and bars than each of other four clusters. Even though, green and purple clusters hold the larger area as red cluster, they contain relatively less number of restaurants, coffee shops, and bars than the red cluster. 

**Based on the observation, we suggest the following:**
* The number of trucks, employees, and drivers should be chosen based on the number of venues in each cluster
* The size of the warehouse in the red clustering zone would be bigger than any other warehouses
* The number of trucks, employees, and drivers in red clustering zone would be at least double than the number of any warehouses  

## Conclusion <a name="conclusion"></a>

The purpose of this project was to locate five warehouses for the client in the Staten Island area in order to supply beverages to the restaurants, coffee shops and bars more efficiently and cost effectively. The approach was to divide the Staten Island area into five sub regions based on the criteria of the distribution of the restaurants, coffee shops and bars in the entire Staten Island area. The centroid of each cluster was identified by running the k-means clustering algorithm. In order to estimate the clustering area containing the restaurants, coffee shops, and bars to which the warehouse would provide beverages, a standard deviation of each cluster was used. Based on the results, this project was able to locate the location of the warehouses, their size, and the number of employees and truck drivers in each warehouse needed. 

Even though, this project recommended the optimal locations for warehouses, final decision on optimal location will be made by the client based on specific characteristics of neighborhoods taking into consideration additional factors like proximity to major roads, real estate availability, prices, social and economic dynamics of every neighborhood etc.