# Capstone Project - Best Place to open an Indian Restaurant in New York


## Introduction



A person wants to open an Indian restaurant in New York. The idea behind this project is that there may not be enough Indian restaurants in New York and it might present a great opportunity for this person who is based in New York. As Indian food is very similar to other Asian cuisines, this person is thinking of opening this restaurant in locations where Asian food is popular. With the purpose in mind, finding the location to open such a restaurant is one of the most important decisions for this person and I am creating a model to help the person in finding the best location to open the restaurant in New York.


## Business Problem
In New York, if someone wants to open a Indian restaurant, where should they consider opening it?

## Data
For model designing following data will be required:

* List of neighborhoods in New York.
* Latitude and Longitude of these neighborhoods.
* Venue data related to Indian and Asian restaurants. This will help us find the best places that are most suitable to open a Indian restaurant.
 
 
Following data sources will be needed to extract/generate the required information:
 
* The information regarding the neighborhoods in New York and their corresponding latitude and longitude coordinates can be downloaded from the following link “https://geo.nyu.edu/catalog/nyu_2451_34572”.
* Use geopy library to get the latitude and longitude values of New York City.
* number of restaurants and their type and location in every neighborhood will be obtained using Foursquare API.


In [1]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - geopy


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2019.9.11  |       hecc5488_0         144 KB  conda-forge
    geographiclib-1.50         |             py_0          34 KB  conda-forge
    openssl-1.1.1c             |       h516909a_0         2.1 MB  conda-forge
    geopy-1.20.0               |             py_0          57 KB  conda-forge
    certifi-2019.9.11          |           py36_0         147 KB  conda-forge
    ------------------------------------------------------------
                                           Total:         2.5 MB

The following NEW packages will be INSTALLED:

    geographiclib:   1.50-py_0         conda-forge
    geopy:           1.20.0-py_0       conda-forge

The following packages will be UPDATED:

    ca-

In [2]:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
print('Data downloaded!')

Data downloaded!


Load and explore the data

In [3]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

In [4]:
neighborhoods_data = newyork_data['features']

#### Tranform the data into a *pandas* dataframe

The next task is essentially transforming this data of nested Python dictionaries into a *pandas* dataframe. So let's start by creating an empty dataframe.

In [5]:
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

Then let's loop through the data and fill the dataframe one row at a time.

In [6]:
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

So let's slice the original dataframe and create a new dataframe of the Manhattan data.

In [7]:
manhattan_data = neighborhoods[neighborhoods['Borough'] == 'Manhattan'].reset_index(drop=True)
manhattan_data.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Manhattan,Marble Hill,40.876551,-73.91066
1,Manhattan,Chinatown,40.715618,-73.994279
2,Manhattan,Washington Heights,40.851903,-73.9369
3,Manhattan,Inwood,40.867684,-73.92121
4,Manhattan,Hamilton Heights,40.823604,-73.949688


#### Use geopy library to get the latitude and longitude values of Manhattan.

In order to define an instance of the geocoder, we need to define a user_agent. We will name our agent <em>ny_explorer</em>, as shown below.

In [8]:
address = 'Manhattan, NY'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Manhattan are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Manhattan are 40.7896239, -73.9598939.


Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.

In [9]:
# The code was removed by Watson Studio for sharing.

#### Let's explore the first neighborhood in our dataframe.

Get the neighborhood's name.

In [10]:
manhattan_data.loc[0, 'Neighborhood']

'Marble Hill'

Get the neighborhood's latitude and longitude values.

In [11]:
neighborhood_latitude = manhattan_data.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = manhattan_data.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = manhattan_data.loc[0, 'Neighborhood'] # neighborhood name

print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name, 
                                                               neighborhood_latitude, 
                                                               neighborhood_longitude))

Latitude and longitude values of Marble Hill are 40.87655077879964, -73.91065965862981.


Let set the limit and radius for the foursquare api

In [12]:
LIMIT = 100
radius = 500

#### Let's create a function to repeat the same process to all the neighborhoods in Manhattan

In [13]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name'],
        v['venue']['id']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category',
                            'Id']
    
    return(nearby_venues)

#### Now run the above function on each neighborhood and create a new dataframe called *manhattan_venues*.

In [14]:

manhattan_venues = getNearbyVenues(names=manhattan_data['Neighborhood'],
                                   latitudes=manhattan_data['Latitude'],
                                   longitudes=manhattan_data['Longitude']
                                  )

Marble Hill
Chinatown
Washington Heights
Inwood
Hamilton Heights
Manhattanville
Central Harlem
East Harlem
Upper East Side
Yorkville
Lenox Hill
Roosevelt Island
Upper West Side
Lincoln Square
Clinton
Midtown
Murray Hill
Chelsea
Greenwich Village
East Village
Lower East Side
Tribeca
Little Italy
Soho
West Village
Manhattan Valley
Morningside Heights
Gramercy
Battery Park City
Financial District
Carnegie Hill
Noho
Civic Center
Midtown South
Sutton Place
Turtle Bay
Tudor City
Stuyvesant Town
Flatiron
Hudson Yards


#### Let's check the size of the resulting dataframe

In [15]:
print(manhattan_venues.shape)
manhattan_venues.head()

(3327, 8)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Id
0,Marble Hill,40.876551,-73.91066,Arturo's,40.874412,-73.910271,Pizza Place,4b4429abf964a52037f225e3
1,Marble Hill,40.876551,-73.91066,Bikram Yoga,40.876844,-73.906204,Yoga Studio,4baf59e8f964a520a6f93be3
2,Marble Hill,40.876551,-73.91066,Tibbett Diner,40.880404,-73.908937,Diner,4b79cc46f964a520c5122fe3
3,Marble Hill,40.876551,-73.91066,Starbucks,40.877531,-73.905582,Coffee Shop,55f81cd2498ee903149fcc64
4,Marble Hill,40.876551,-73.91066,Dunkin',40.877136,-73.906666,Donut Shop,4b5357adf964a520319827e3


So let's slice the original dataframe and create a new dataframe of the Indian and Asian Restaurants.

In [16]:
category = ['Indian Restaurant','Asian Restaurant']
manhattan_indian_restaurant = manhattan_venues[manhattan_venues['Venue Category'].isin(category)]


#### Now I will creat a map of New York with Indian and Asian restaurants superimposed on top.

In [17]:
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=12)

# add markers to map
for lat, lng, Venue, neighborhood in zip(manhattan_indian_restaurant['Neighborhood Latitude'], manhattan_indian_restaurant['Neighborhood Longitude'], manhattan_indian_restaurant['Venue'], manhattan_indian_restaurant['Neighborhood']):
    label = '{}, {}'.format(neighborhood, Venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_newyork)  
    
map_newyork

#### Let's create a function to get the ratings of all the Indian and Asian restaurants in New York

In [26]:
def getVenueRating(id):
    
    # create the API request URL
    url = 'https://api.foursquare.com/v2/venues/{}?&client_id={}&client_secret={}&v={}'.format(
        id,
        CLIENT_ID, 
        CLIENT_SECRET, 
         VERSION, 
        )
            
    # make the GET request
    result = requests.get(url).json()
    
    if(result["response"] == {}):
        rating = 'NA'
    else:
        rating = result["response"]['venue']['rating']
        
    return(rating)

 Now I will apply the above function on each resturant and add two new columns in dataframe called *Rating and Id*.

In [27]:
manhattan_indian_restaurant['Rating'] = manhattan_indian_restaurant['Id'].apply(getVenueRating)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [20]:
manhattan_indian_restaurant[['Rating','Id']]

Unnamed: 0,Rating,Id
54,8.5,4b89ac72f964a520e94a32e3
66,9.2,574f2ae0498ec2a700651b95
149,7.8,4ae7876ef964a5201eac21e3
294,7.6,54c2bd96498eaf5142e3fe92
312,7.4,5914ff32b23dfa207eca38de
338,8.0,529d382a11d2dd5ef107e641
565,8.3,579fb23f498e7f9f1b7bd04c
822,7.7,4b0dec08f964a520ae5223e3
850,8.4,591890f43abcaf1ddca66e85
863,8.3,42489a80f964a5208b201fe3


## Now Cluster the Restaurant on their Rating

Run *k*-means to cluster the neighborhood into 3 clusters.

In [21]:
# set number of clusters
kclusters = 3

manhattan_indian_restaurant_clustering = manhattan_indian_restaurant['Rating']


# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(manhattan_indian_restaurant_clustering.values.reshape(-1,1))

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 1, 2, 2, 2, 2, 1, 2, 1, 1], dtype=int32)

Let's insert the cluster column along with their respective restaurant in the manhattan_indian_restaurant.

In [22]:
manhattan_indian_restaurant.insert(0, 'Cluster Labels', kmeans.labels_)


Finally, let's visualize the resulting clusters

In [23]:
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(manhattan_indian_restaurant['Neighborhood Latitude'], manhattan_indian_restaurant['Neighborhood Longitude'], manhattan_indian_restaurant['Neighborhood'], manhattan_indian_restaurant['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

## Result

The results from k-means clustering show that we can categorize New York neighborhoods into 3 clusters based on how many Indian and Asain restaurants are in each neighborhood:
* Cluster 0: Restaurants having rating less than 7
* Cluster 1: Restaurants having rating between  7 and 8
* Cluster 2:Restaurants having rating greater than 8

The results are visualized in the above map with Cluster 0 in red color, Cluster 1 in light green color and Cluster 2 in blue color.



## Recommendations
Most of Indian and Asian restaurants are in Cluster 0 and 1 which is around Manhattan Valley area have lowest rating on foursquare. So, there is a good opportunity to open near this area as the competition seems to be low.Therefore, this project recommends to open an authentic Indian restaurant in these locations with little to no competition. Nonetheless, if the food is authentic, affordable and good taste, I am confident that it will have great following everywhere.


## Conclusion



In this project, we have gone through the process of identifying the business problem, specifying the data required, extracting and preparing the data, performing the machine learning by utilizing k-means clustering and providing a recommendation to the stakeholder.
