# Capstone Project - The Battle of the Neighborhoods - Where to Build a BBQ Restaurant in Manhattan, New York City?
### Applied Data Science Capstone

## Introduction: Business Problem <a name="introduction"></a>

In this project we will try to find an optimal location for a Barbecue restaurant. Specifically, this report will be targeted to stakeholders interested in opening a Barbacue restaurant in **Manhattan**, New York.

In addition to Barbacue, there are also an abundance of restaurants and bars in the city. In order to solve this, we will try to detect **locations that are not already crowded with restaurants**. We are also particularly interested in **areas with no barbacue or general American restaurants in the general vicinity**.

We will use our data science expertise to generate a few of the most promising neighborhoods based on this criteria. Advantages of each area will then be clearly expressed so that the best possible final location can be chosen by stakeholders.

## Data <a name="data"></a>

Based on the definition of our problem, factors that will influence our decision are:
* number of existing barbecue restaurants in the neighborhood
* number of and distance to other similar restaurants in the neighborhood, if any

I decided to use a the following New York City data set which contains each borough and all neighborhoods within each borough along with their latitude and longitude coordinates.

https://geo.nyu.edu/catalog/nyu_2451_34572

The following data sources will be needed to extract/generate the required information:
* Number of restaurants, what type they are, and location in every neighborhood will be obtained using **Foursquare API**
* We will use the **geocoder from the geopy library** to obtain latitude and longitude information from venues of interest
* **Folium Maps** will be used to display chloropleth maps of restaurant ratings in each borough

First, we import and install all of the necessary dependencies for obtaining our dataset

In [20]:

import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes 
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes 
import folium # map rendering library

print('Libraries imported.')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Libraries imported.


#### We then enter in our Foursquare credentials

In [21]:
CLIENT_ID = '53CGMWQFD4N41RHRIHGJVUUBWJRFATUDXJSTKKPC20VHCKQH' # your Foursquare ID
CLIENT_SECRET = 'SMQP5O4LRDDMVSTRIBSINUWEEZTVPHJ0QB051A2PQTZPW52W' # your Foursquare Secret
VERSION = '20180604'
LIMIT = 150
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 53CGMWQFD4N41RHRIHGJVUUBWJRFATUDXJSTKKPC20VHCKQH
CLIENT_SECRET:SMQP5O4LRDDMVSTRIBSINUWEEZTVPHJ0QB051A2PQTZPW52W


#### We then download the aforementioned data set and examine an entry of it

In [22]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


In [24]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

#### All the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.

In [64]:
neighborhoods_data = newyork_data['features']

#### Now let's take a look at the first item in the list

In [65]:
neighborhoods_data[0]

{'type': 'Feature',
 'id': 'nyu_2451_34572.1',
 'geometry': {'type': 'Point',
  'coordinates': [-73.84720052054902, 40.89470517661]},
 'geometry_name': 'geom',
 'properties': {'name': 'Wakefield',
  'stacked': 1,
  'annoline1': 'Wakefield',
  'annoline2': None,
  'annoline3': None,
  'annoangle': 0.0,
  'borough': 'Bronx',
  'bbox': [-73.84720052054902,
   40.89470517661,
   -73.84720052054902,
   40.89470517661]}}

#### The next task is essentially transforming this data of nested Python dictionaries into a pandas dataframe. So let's start by creating an empty dataframe.

In [66]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

In [67]:
neighborhoods

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude


#### Then let's loop through the data and fill the dataframe one row at a time

In [68]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

#### Let's take a look at the first 5 rows of our new dataframe

In [69]:
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


In [31]:
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
        len(neighborhoods['Borough'].unique()),
        neighborhoods.shape[0]
    )
)

The dataframe has 5 boroughs and 306 neighborhoods.


#### We then use the geocoder within Python's geopy library to obtain the latitude and longitude coordinates

In [70]:
address = 'New York City, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


#### Let's generate a Folium Map of New York City with all of its neighborhoods superimposed on top

In [33]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# add markers to map
for lat, lng, borough, neighborhood in zip(neighborhoods['Latitude'], neighborhoods['Longitude'], neighborhoods['Borough'], neighborhoods['Neighborhood']):
    label = '{}, {}'.format(neighborhood, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

#### Now let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [71]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


#### We now see that the new Manhattan dataframe has 40 neighborhoods

In [72]:
manhattan_data.shape

(40, 4)

#### Now let's obtain the coordinates of the Manhattan borough, the one will be analyzing

In [35]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


#### We will now generate another Folium map with all of Manhatten neighborhoods superimposed on it

In [36]:
map_manhattan = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, label in zip(manhattan_data['Latitude'], manhattan_data['Longitude'], manhattan_data['Neighborhood']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

#### Let's re-enter our Foursquare credentials

In [37]:
CLIENT_ID = 'B2SOBN12MQU3S50UDGCFKZTLMPMGMSAD53AZ5OJIT4FW42QQ' # your Foursquare ID
CLIENT_SECRET = 'GJLTIEY5MGSQCHILQHPK45VCDG3BOO4DOCHUPPQWP2FSMWAI' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 500 
radius = 5000 

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: B2SOBN12MQU3S50UDGCFKZTLMPMGMSAD53AZ5OJIT4FW42QQ
CLIENT_SECRET:GJLTIEY5MGSQCHILQHPK45VCDG3BOO4DOCHUPPQWP2FSMWAI


#### Now let's define and call a function that loops through all of New York's neighborhoods and returns Neighborhoods in Manhattan

In [73]:
def getNearbyVenues(names, latitudes, longitudes, radius=500, categoryId=''):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
        
        if (categoryId != ''):
                url = url + '&categoryId={}'
                url = url.format(categoryId)

        # make the GET request
        response = requests.get(url).json()
        results = response["response"]['venues']
        
        # return only relevant information for each nearby venue
        for v in results:
                success = False
                try:
                    category = v['categories'][0]['name']
                    success = True
                except:
                    pass

                if success:
                    venues_list.append([(
                        name, 
                        lat, 
                        lng, 
                        v['name'], 
                        v['location']['lat'], 
                        v['location']['lng'],
                        v['categories'][0]['name']
                    )])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    return(nearby_venues)


#### Now let's create a new dataframe called 'bbq_venues' by calling our newly built function

In [45]:
bbq_venues = getNearbyVenues(names=manhattan_data['Neighborhood'], 
                                latitudes=manhattan_data['Latitude'], 
                                longitudes=manhattan_data['Longitude'], 
                                radius=1000, 
                                categoryId='4bf58dd8d48988d1df931735')

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


#### Now let's create a new Folium map with Manhattan's neighborhoods and venues superimposed on it

In [47]:
for lat, lng, venue, venue_cat, neighborhood in zip(bbq_venues['Venue Latitude'], bbq_venues['Venue Longitude'], bbq_venues['Venue'], bbq_venues['Venue Category'], bbq_venues['Neighborhood']):
    label = '{}, {}, {}'.format(venue, venue_cat, neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_manhattan)  
    
map_manhattan

#### Now let's check the size of the new dataframe, which shows that there are 698 BBQ restaurants in Manhattan

In [48]:
print(bbq_venues.shape)
bbq_venues.head()

(698, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Marble Hill,40.876551,-73.91066,Sang's Barbecue Co,40.868831,-73.902634,BBQ Joint
1,Chinatown,40.715618,-73.994279,友情客串bbq,40.71755,-73.99506,Restaurant
2,Chinatown,40.715618,-73.994279,Best Wingers,40.717716,-73.98514,Wings Joint
3,Chinatown,40.715618,-73.994279,HNH BBQ,40.718364,-74.000785,BBQ Joint
4,Chinatown,40.715618,-73.994279,K&S BBQ Ramen,40.718332,-73.991525,BBQ Joint


#### For now, let's clean up the data to just show the number of BBQ restaurants in each neighborhood

In [56]:
bbq_venues.groupby('Neighborhood').count().drop(['Neighborhood Latitude', 'Neighborhood Longitude', 'Venue Latitude', 'Venue Longitude', 'Venue Category'], axis=1).rename(columns={'Venue':'# of Venues'})

Unnamed: 0_level_0,# of Venues
Neighborhood,Unnamed: 1_level_1
Battery Park City,15
Carnegie Hill,10
Central Harlem,7
Chelsea,23
Chinatown,18
Civic Center,21
Clinton,28
East Harlem,5
East Village,18
Financial District,15


#### Now let's see how many unique categories there are of BBQ restaurants

In [57]:
print('There are {} uniques categories.'.format(len(bbq_venues['Venue Category'].unique())))

There are 23 uniques categories.


#### Up next in the Methodology section we will cluster the neighborhoods to determine which neighborhood would be best to open a BBQ restaurant in Manhattan