# Data Science Capstone Project

## Introduction/Business Problem

An enthusiastic young chef plans to open a new high-end Mexican restaurant in New York City. In order for the business to be a success, the chef wants to find the perfect location. In order to find the best location, the chef wants answers to the following questions.
* Are there any similar restaurants already open in New York City, and if so, where are they?
* What kinds of neighborhoods are the existing Mexican restaurants in, and what neighborhoods are most similar?
* Are there any neighborhoods that are similar to those with existing Mexican restaurants that don't currently have a Mexican restaurant?

## Data

We will use the following sources of data for this project.

* Foursquare venue data via the Foursquare api
* New York City neighborhood data from NYU as used in the neighborhood project earlier in the capstone course
* Maps from the folium library in python
* Location data geopy library

We will load in all of the necessary libraries before gathering data.

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Json data handling libraries
import json
from pandas.io.json import json_normalize

# Mapping package
import folium

# Library to handle requests for the foursquare api
import requests

# Library to look up location data
from geopy.geocoders import Nominatim

# K-means clustering package from scikit-learn
from sklearn.cluster import KMeans

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

### Neighborhood data

First, we download the New York neighborhood data provided at https://geo.nyu.edu/catalog/nyu_2451_34572. The data will be accessed via the same server that we used in the earlier project.

In [3]:
!curl -q -o 'newyork_data.json' https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DP0701EN/data/nyu_2451_34572-geojson.json
print('Data downloaded!')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  113k  100  113k    0     0   334k      0 --:--:-- --:--:-- --:--:--  335k
Data downloaded!


Next, we perform the necessary manipulations to translate the json file into a more useful pandas dataframe.

In [7]:
with open('newyork_data.json') as json_data:
    newyork_data = json.load(json_data)

# All of the data we need is in the features key
neighborhoods_data = newyork_data['features']

# Define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# Instantiate the dataframe
neighborhoods = pd.DataFrame(columns=column_names)

# Load data into the dataframe from the json file
for data in neighborhoods_data:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    neighborhoods = neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)

# View the head of the new dataframe
neighborhoods.head()

Unnamed: 0,Borough,Neighborhood,Latitude,Longitude
0,Bronx,Wakefield,40.894705,-73.847201
1,Bronx,Co-op City,40.874294,-73.829939
2,Bronx,Eastchester,40.887556,-73.827806
3,Bronx,Fieldston,40.895437,-73.905643
4,Bronx,Riverdale,40.890834,-73.912585


### Geopy location data

We will use geopy to find the latitude and longitude of New York City

In [8]:
# Define the location we are interested in
address = 'New York City, NY'

# Look up and save lat/long data for NYC
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of New York City are 40.7127281, -74.0060152.


### Folium maps

We will use folium to create informative graphics to help the client make his decision. Here is an example of a map of NYC

In [29]:
# Create a map of NYC using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)

# Display the map
map_newyork

### Foursquare data

In order to use the foursquare api, we will need a client id and secret. This will be deleted before the notebook is published.

In [30]:
CLIENT_ID = '' # your Foursquare ID
CLIENT_SECRET = '' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

We will use the following function to gather venue data for the neighborhoods of NYC

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [22]:
nyc_venues = getNearbyVenues(neighborhoods['Neighborhood'],
                                   latitudes=neighborhoods['Latitude'],
                                   longitudes=neighborhoods['Longitude']
                                  )

Wakefield
Co-op City
Eastchester
Fieldston
Riverdale
Kingsbridge
Marble Hill
Woodlawn
Norwood
Williamsbridge
Baychester
Pelham Parkway
City Island
Bedford Park
University Heights
Morris Heights
Fordham
East Tremont
West Farms
High  Bridge
Melrose
Mott Haven
Port Morris
Longwood
Hunts Point
Morrisania
Soundview
Clason Point
Throgs Neck
Country Club
Parkchester
Westchester Square
Van Nest
Morris Park
Belmont
Spuyten Duyvil
North Riverdale
Pelham Bay
Schuylerville
Edgewater Park
Castle Hill
Olinville
Pelham Gardens
Concourse
Unionport
Edenwald
Bay Ridge
Bensonhurst
Sunset Park
Greenpoint
Gravesend
Brighton Beach
Sheepshead Bay
Manhattan Terrace
Flatbush
Crown Heights
East Flatbush
Kensington
Windsor Terrace
Prospect Heights
Brownsville
Williamsburg
Bushwick
Bedford Stuyvesant
Brooklyn Heights
Cobble Hill
Carroll Gardens
Red Hook
Gowanus
Fort Greene
Park Slope
Cypress Hills
East New York
Starrett City
Canarsie
Flatlands
Mill Island
Manhattan Beach
Coney Island
Bath Beach
Borough Park
Dyker

In [23]:
nyc_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Wakefield,40.894705,-73.847201,Lollipops Gelato,40.894123,-73.845892,Dessert Shop
1,Wakefield,40.894705,-73.847201,Walgreens,40.896528,-73.8447,Pharmacy
2,Wakefield,40.894705,-73.847201,Carvel Ice Cream,40.890487,-73.848568,Ice Cream Shop
3,Wakefield,40.894705,-73.847201,Rite Aid,40.896649,-73.844846,Pharmacy
4,Wakefield,40.894705,-73.847201,Dunkin',40.890459,-73.849089,Donut Shop


Before moving on, we will create a table containing all of the Mexican restaurants from our dataset.

In [26]:
mex_rest = nyc_venues[manhattan_venues['Venue Category'].str.contains('Mexican Restaurant')]
mex_rest.reset_index(drop=True, inplace=True)
mex_rest.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Kingsbridge,40.881687,-73.902818,Estrellita Poblana V,40.879687,-73.906257,Mexican Restaurant
1,Kingsbridge,40.881687,-73.902818,Picante Picante Mexican Restaurant,40.878252,-73.902936,Mexican Restaurant
2,Kingsbridge,40.881687,-73.902818,Chipotle Mexican Grill,40.884566,-73.900474,Mexican Restaurant
3,Norwood,40.877224,-73.879391,Queen of Tacos,40.8802,-73.883434,Mexican Restaurant
4,Baychester,40.866858,-73.835798,Moe's Southwest Grill,40.86631,-73.83032,Mexican Restaurant


In [28]:
mex_rest.shape

(179, 7)