## Introduction

In this project, we will attempt to find an area in San Francisco, CA for Chef Jimmy De Luca to move to that has similar characteristics to his current home: the beautiful Soho neighborhood in New York City. 

Once found, we will try to hunt out the closest district to him that will likely have the most demand for Italian food, since Jimmy is planning on opening a second Italian restaurant in the city. To determine this, we will look for neighborhoods in San Francisco that have both a **high number of restaurants** and a **high number of Italian restaurants**, which should be a strong indication that food (specifically, Italian food) is in demand in the area.

We will then present to Jimmy a selection of recommendations he should consider both for relocating his home and for opening an Italian restaurant. Advantages will be proved for each selection.

## Data 

The **Foursquare API** will be leveraged to help Jimmy answer who important questions: 
* What neighborhood in San Francisco should he move to that is most similar to New York City’s Soho neighborhood?
* What is the best spot to locate his next hit Italian restaurant?

To answer the first question, a dataset of all neighborhoods in San Francisco is needed, which can be found at **data.sfgov.org**. The geocoder library can then be used to get the latitude and longitude of each neighborhood. Once compiled, the Foursquare API will help perform a search of the most popular spots around each neighborhood. This will allow us to effectively compare similarities between different neighborhoods.

For the second question, the **Foursquare API** will be used once again by searching for and classifying the most popular spots in each neighborhood and determining which ones yield the highest number of restaurants. From this selected list, we can then see how many Italian restaurants exist within the concentration. We will also use a **distance formula** (using the latitude and longitude coordinates) to determine which neighborhoods are closest in proximity to Jimmy’s newfound neighborhood. All combined together, this will allow us to find the closest neighborhood to Jimmy's San Francisco home that would be the most fitting to open his Italian restaurant. 

### Grabbing the Neighborhoods and their Latitudes / Longitudes

Let's start off by importing a dataset of all San Francisco neighborhoods, which can be extracted from **data.sfgov.org**. From here, we will use **geocoder** to fetch the latitude and longitude coordinates for each respective neighborhood. 

In [1]:
import pandas as pd
from geopy.geocoders import Nominatim

san_fran_csv = pd.read_csv('San_Francisco_Analysis_Neighborhoods.csv')
san_fran_data = san_fran_csv[['NHOOD']]

for index, row in san_fran_data.itertuples():
    address = row + ", SF"

    try:
        geolocator = Nominatim(user_agent="sf_explorer")
        location = geolocator.geocode(address)
        latitude = location.latitude
        longitude = location.longitude
    except:
        latitude = 0
        longitude = 0
        
    san_fran_data.at[index, 'Latitude'] = latitude
    san_fran_data.at[index, 'Longitude'] = longitude

Add in the latitude and longitude coordinates for the neighborhoods that were not found by geolocator. 

In [2]:
inner_richmond = [37.780643, -122.472596]
san_fran_data.at[7, 'Latitude'] = inner_richmond[0]
san_fran_data.at[7, 'Longitude'] = inner_richmond[1]
    
love_mountain = [37.769621, -122.422181]
san_fran_data.at[17, 'Latitude'] = love_mountain[0]
san_fran_data.at[17, 'Longitude'] = love_mountain[1]

oceanview = [37.713651, -122.45748]
san_fran_data.at[26, 'Latitude'] = oceanview[0]
san_fran_data.at[26, 'Longitude'] = oceanview[1]

sunset = [37.760828, -122.496574]
san_fran_data.at[28, 'Latitude'] = sunset[0]
san_fran_data.at[28, 'Longitude'] = sunset[1]

west_of_twin_peaks = [37.7406, -122.4589]
san_fran_data.at[39, 'Latitude'] = west_of_twin_peaks[0]
san_fran_data.at[39, 'Longitude'] = west_of_twin_peaks[1]

san_fran_data.columns = ['Neighborhood', 'Latitude', 'Longitude']

Let's visualize the location data we have so far by plotting it to a map of San Francisco.

In [3]:
import folium

san_fran_lat = 37.7749
san_fran_long = -122.4194

# create map of Toronto using latitude and longitude values
map_san_fran = folium.Map(location=[san_fran_lat, san_fran_long], zoom_start=10)

# add markers to map
for lat, lng, neighborhood in zip(san_fran_data['Latitude'], san_fran_data['Longitude'], san_fran_data['Neighborhood']):
    label = neighborhood
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_san_fran)  
    
map_san_fran

Before delving into the Foursquare API to gather information on most popular spots within each neighborhood, let's first add in New York City's SoHo neighborhood to our dataset, which will be referenced later when performing a K-Means clustering analysis. 

In [4]:
soho = {'Neighborhood': 'SoHo', 'Latitude': 40.7233, 'Longitude': -74.0030}

san_fran_data = san_fran_data.append(soho, ignore_index=True)

### Foursquare

Now that we have all the neighborhoods required to perform our clustering analysis to determine which neighborhoods are most similar to New York City's SoHo neighborhood, we can begin by first extracting the most popular venues within each neighborhood by leveraging the **Foursquare API**.

In [5]:
CLIENT_ID = 'XUTA3MX3UVITEGZQJVP22LZJI5M0OJAYMKHYMSQVBMCVEXAC' 
CLIENT_SECRET = '3TSV5HH0S5XZROMO1IJ31LMUTNW04HDM5AA2I1DGVAWKTAU1' 
VERSION = '20180605' 
LIMIT = 100

In [6]:
# Get venues from each neighborhood and extract the category of the venue
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # Create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # Make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # Return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [7]:
# Call the getNearbyVenues() function to get venue categories for each neighborhood in San Francisco
import requests

san_fran_venues = getNearbyVenues(names=san_fran_data['Neighborhood'],
                                   latitudes=san_fran_data['Latitude'],
                                   longitudes=san_fran_data['Longitude']
                                  )

Bayview Hunters Point
Bernal Heights
Castro/Upper Market
Chinatown
Excelsior
Financial District/South Beach
Glen Park
Inner Richmond
Golden Gate Park
Haight Ashbury
Hayes Valley
Inner Sunset
Japantown
McLaren Park
Tenderloin
Lakeshore
Lincoln Park
Lone Mountain/USF
Marina
Russian Hill
Mission
Mission Bay
Nob Hill
Seacliff
Noe Valley
North Beach
Oceanview/Merced/Ingleside
South of Market
Sunset/Parkside
Outer Mission
Outer Richmond
Pacific Heights
Portola
Potrero Hill
Presidio
Presidio Heights
Treasure Island
Twin Peaks
Visitacion Valley
West of Twin Peaks
Western Addition
SoHo


In [8]:
# Take a peak at the San Francisco venues dataset created from the getNearbyVenues() function
san_fran_venues.head()

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Bayview Hunters Point,37.741286,-122.377633,Heron's Head Park,37.739663,-122.375904,Park
1,Bayview Hunters Point,37.741286,-122.377633,Bay Natives Nursery,37.740532,-122.376845,Garden Center
2,Bayview Hunters Point,37.741286,-122.377633,Speakeasy Ales & Lagers,37.738468,-122.380874,Brewery
3,Bayview Hunters Point,37.741286,-122.377633,Hunter’s Point Shoreline,37.73824,-122.376753,Waterfront
4,Bayview Hunters Point,37.741286,-122.377633,USPS Cafeteria,37.740744,-122.382309,Café


## Methodology

Now that we have successfully gathered **location data** on all neighborhoods in San Francisco and found the most **popular venues** in each respective area, we are ready to use a clustering algorithm to group similar neighborhoods together and see which ones share the most characteristics with New York City's SoHo neighborhood.

To do this, we will leverage the **K-Means clustering algorithm** due to its *simplicity* and because the data we are working with is *unlabeled*. Once the clusters are created, we will observe the neighborhoods that were grouped with New York City's SoHo neighborhood and determine which area would be the best for Jimmy to move to based on venue similarity. 

After this, we need to determine where Jimmy De Luca should open his next Italian restaurants. This can be accomplished by looking at neighborhoods nearby the area selected in the previous step and determining which of these areas has a *high number of restaurants* and a *high concentration of Italian restaurants*. By taking this approach, we will ensure that the neighborhood is a common area for restaurants to be located while also showing there is a substantial demand for authentic Italian food.  

## Analysis

Let's begin our data analysis by transforming the current dataset *san_fran_venues* into a usable form for the K-Means clustering algorithm by utilizing one hot encoding. Once accomplished, we can perform K-Means clustering on the data and plot the results on a Folium map. In addition, we can create a table for the cluster containing New York City's Soho neighborhood and show the top 10 most popular venues within each area. This will help us determine the neighborhood in San Francisco most similar to New York City's SoHo neighborhood. 

In [9]:
# One hot encoding
san_fran_onehot = pd.get_dummies(san_fran_venues[['Venue Category']], prefix="", prefix_sep="")

# Add neighborhood column back to dataframe
san_fran_onehot['Neighborhood'] = san_fran_venues['Neighborhood'] 

# Move neighborhood column to the first column
fixed_columns = [san_fran_onehot.columns[-1]] + list(san_fran_onehot.columns[:-1])
san_fran_onehot = san_fran_onehot[fixed_columns]

# Group rows by neighborhood by taking the mean of the frequency of occurence of each category
san_fran_grouped = san_fran_onehot.groupby('Neighborhood').mean().reset_index()

san_fran_grouped.head()

Unnamed: 0,Neighborhood,Yoga Studio,ATM,Accessories Store,Adult Boutique,Alternative Healer,American Restaurant,Antique Shop,Arepa Restaurant,Argentinian Restaurant,...,Veterinarian,Video Store,Vietnamese Restaurant,Waterfall,Waterfront,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store
0,Bayview Hunters Point,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.142857,0.0,0.0,0.0,0.0,0.0
1,Bernal Heights,0.028571,0.0,0.0,0.0,0.0,0.028571,0.0,0.0,0.0,...,0.0,0.0,0.028571,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Castro/Upper Market,0.020202,0.0,0.0,0.010101,0.0,0.010101,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.020202,0.010101,0.0,0.0
3,Chinatown,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.02,0.0,0.0,0.0,0.02,0.0,0.0,0.0
4,Excelsior,0.021739,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.043478,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
from sklearn.cluster import KMeans

# Set number of clusters
kclusters = 8

san_fran_grouped_clustering = san_fran_grouped.drop('Neighborhood', 1)

# Run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(san_fran_grouped_clustering)

# Check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [11]:
# Function to sort venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [12]:
import numpy as np

# Create a dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# Create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# Create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = san_fran_grouped['Neighborhood']

for ind in np.arange(san_fran_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(san_fran_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview Hunters Point,Food Truck,Garden Center,Brewery,Waterfront,Restaurant,Café,Park,Food & Drink Shop,Financial or Legal Service,Fish Market
1,Bernal Heights,Coffee Shop,Trail,Playground,Mexican Restaurant,Italian Restaurant,Bakery,Yoga Studio,Park,Pizza Place,Cocktail Bar
2,Castro/Upper Market,Gay Bar,Coffee Shop,New American Restaurant,Cosmetics Shop,Thai Restaurant,Arts & Crafts Store,Seafood Restaurant,Playground,Convenience Store,Pet Store
3,Chinatown,Coffee Shop,Cocktail Bar,Chinese Restaurant,Bakery,Italian Restaurant,New American Restaurant,Hotel,Restaurant,Men's Store,Bubble Tea Shop
4,Excelsior,Mexican Restaurant,Bakery,Pharmacy,Chinese Restaurant,Latin American Restaurant,Bank,Sandwich Place,Grocery Store,Pizza Place,Vietnamese Restaurant


In [13]:
# Add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

san_fran_merged = san_fran_data

# Merge san_fran_grouped with san_fran_data to add latitude/longitude for each neighborhood
san_fran_merged = san_fran_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

san_fran_merged.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview Hunters Point,37.741286,-122.377633,3,Food Truck,Garden Center,Brewery,Waterfront,Restaurant,Café,Park,Food & Drink Shop,Financial or Legal Service,Fish Market
1,Bernal Heights,37.742986,-122.415804,1,Coffee Shop,Trail,Playground,Mexican Restaurant,Italian Restaurant,Bakery,Yoga Studio,Park,Pizza Place,Cocktail Bar
2,Castro/Upper Market,37.760856,-122.434957,1,Gay Bar,Coffee Shop,New American Restaurant,Cosmetics Shop,Thai Restaurant,Arts & Crafts Store,Seafood Restaurant,Playground,Convenience Store,Pet Store
3,Chinatown,37.794301,-122.406376,1,Coffee Shop,Cocktail Bar,Chinese Restaurant,Bakery,Italian Restaurant,New American Restaurant,Hotel,Restaurant,Men's Store,Bubble Tea Shop
4,Excelsior,37.721794,-122.435382,1,Mexican Restaurant,Bakery,Pharmacy,Chinese Restaurant,Latin American Restaurant,Bank,Sandwich Place,Grocery Store,Pizza Place,Vietnamese Restaurant


In [14]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# Create map
map_clusters = folium.Map(location=[san_fran_lat, san_fran_long], zoom_start=11)

# Set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(san_fran_merged['Latitude'], san_fran_merged['Longitude'], san_fran_merged['Neighborhood'], san_fran_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

Now that we have successfully visualized the classification of neighborhoods by their respective cluster on a Folium map, let's display a table of the **first cluster**, where New York City's SoHo neighborhood has been grouped. At a glance of the map, we can see that the first cluster is the largest cluster of the eight. That being said, we will try to determine from this group which has the most common top venues in its areas as it compares to New York City's SoHo neighborhood. 

**NOTE:** Different cluster sizes were tested out to try to break down the first cluster into more granular groupings. However, even after bumping the number of clusters up to eight (and even as high as 12), the grouping stayed about the same size. 

In [15]:
soho_cluster = san_fran_merged.loc[san_fran_merged['Cluster Labels'] == 1, san_fran_merged.columns[[0] + list(range(4, san_fran_merged.shape[1]))]]

soho_cluster

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Bernal Heights,Coffee Shop,Trail,Playground,Mexican Restaurant,Italian Restaurant,Bakery,Yoga Studio,Park,Pizza Place,Cocktail Bar
2,Castro/Upper Market,Gay Bar,Coffee Shop,New American Restaurant,Cosmetics Shop,Thai Restaurant,Arts & Crafts Store,Seafood Restaurant,Playground,Convenience Store,Pet Store
3,Chinatown,Coffee Shop,Cocktail Bar,Chinese Restaurant,Bakery,Italian Restaurant,New American Restaurant,Hotel,Restaurant,Men's Store,Bubble Tea Shop
4,Excelsior,Mexican Restaurant,Bakery,Pharmacy,Chinese Restaurant,Latin American Restaurant,Bank,Sandwich Place,Grocery Store,Pizza Place,Vietnamese Restaurant
5,Financial District/South Beach,Coffee Shop,Café,Food Truck,Juice Bar,Art Gallery,Salad Place,Bar,Italian Restaurant,Sandwich Place,Gym
6,Glen Park,Coffee Shop,Café,Park,Sushi Restaurant,Baseball Field,French Restaurant,Thai Restaurant,Gym,Grocery Store,Sandwich Place
7,Inner Richmond,Chinese Restaurant,Bakery,Asian Restaurant,Grocery Store,Korean Restaurant,Japanese Curry Restaurant,Coffee Shop,Food Court,Beer Store,Marijuana Dispensary
8,Golden Gate Park,Park,Music Venue,Lake,Bus Stop,Dog Run,Harbor / Marina,Track,BBQ Joint,Sculpture Garden,Waterfall
9,Haight Ashbury,Boutique,Thrift / Vintage Store,Café,Clothing Store,Coffee Shop,Shoe Store,Thai Restaurant,Convenience Store,Bookstore,Breakfast Spot
10,Hayes Valley,Wine Bar,Clothing Store,Boutique,Optical Shop,Pizza Place,Dessert Shop,New American Restaurant,Sushi Restaurant,Ice Cream Shop,French Restaurant


In [16]:
soho_cluster[soho_cluster['1st Most Common Venue'] == 'Italian Restaurant']

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
19,Russian Hill,Italian Restaurant,Park,Coffee Shop,Playground,Pizza Place,Café,Dive Bar,Sushi Restaurant,Chocolate Shop,Rock Club
22,Nob Hill,Italian Restaurant,Hotel,Coffee Shop,Café,Bar,Wine Bar,French Restaurant,Yoga Studio,American Restaurant,Hotel Bar
41,SoHo,Italian Restaurant,Hotel,Boutique,Coffee Shop,Clothing Store,Mediterranean Restaurant,Women's Store,Men's Store,Sushi Restaurant,Dessert Shop


We are going to assume that by only looking at neighborhoods sharing the the 1st most common venue with SoHo, we can both minimize the long list of neighborhoods that make up the first cluster from the K-Means algorithm while also carefully working to find the most similar neighborhood to SoHo. 

As shown in the above table, if we take a look at all neighborhoods in San Francisco that share the 1st most common venue with New York City's SoHo, the list dwindles down to **Russian Hill** and **Nob Hill**. 

The first observation to note is the fact that both Nob Hill and SoHo have "Hotel" as the 2nd most common venue in their respective areas. 

Further comparing the three neighborhoods, we can see that SoHo has "Coffee Shop" as it's 4th most common venue, which also resides in both Russian Hill and Nob Hill, both as the 3rd most common venue. 

Last, SoHo and Nob Hill share common ground on European cuisine, with Mediterranean restaurant residing as the 6th most common venue for SoHo and French restaurant ranking as the 7th most common venue for Nob Hill. SoHo and Russian Hill also share an affiliation for Sushi restaurants, where the category is ranked number eight for Russian Hill and number nine for SoHo. 

All that being taken into account, it appears **Nob Hill** shares the most common characteristics with New York City's SoHo neighborhood. This will be the top suggestion we provide Jimmy De Luca as the place he should consider moving to. However, when calculating the neighborhoods where Jimmy should open up his Italian restaurant in the next step, we will also consider *Russian Hill* as a viable neighborhood to move to so Jimmy is presented with a variety of options.

Let's now find the top five closest neighborhoods to both Nob Hill and Russian Hill as possible contenders for locations where Jimmy could open up his Italian restaurant. Once calculated, we will answer the following questions: 
1. How many restaurants are within each of these areas?
2. How many restaurants are in fact Italian restaurants?
3. Which neighborhood has the highest number of restaurants and the highest ratio of Italian restaurants to total number of restaurants?

### Finding the top three closest neighborhoods

In [18]:
## Calculate the distance of each neighborhood to Nob Hill

from math import sin, cos, sqrt, atan2, radians

# Approximate radius of earth in km
R = 6373.0

nob_hill_lat = san_fran_data.at[22, 'Latitude']
nob_hill_long = san_fran_data.at[22, 'Longitude']

for index in range(0, len(san_fran_data.index)):
    cur_lat = san_fran_data.at[index, 'Latitude']
    cur_long = san_fran_data.at[index, 'Longitude']
    
    dlon = cur_long - nob_hill_long
    dlat = cur_lat - nob_hill_lat
    
    a = sin(dlat / 2)**2 + cos(nob_hill_lat) * cos(cur_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    
    san_fran_data.at[index, 'Dist. to Nob Hill'] = distance
    

san_fran_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Nob Hill
0,Bayview Hunters Point,37.741286,-122.377633,408.55122
1,Bernal Heights,37.742986,-122.415804,320.429024
2,Castro/Upper Market,37.760856,-122.434957,241.516606
3,Chinatown,37.794301,-122.406376,56.684316
4,Excelsior,37.721794,-122.435382,473.124883


In [19]:
## Calculate the distance of each neighborhood to Russian Hill

from math import sin, cos, sqrt, atan2, radians

# Approximate radius of earth in km
R = 6373.0

russian_hill_lat = san_fran_data.at[19, 'Latitude']
russian_hill_long = san_fran_data.at[19, 'Longitude']

for index in range(0, len(san_fran_data.index)):
    cur_lat = san_fran_data.at[index, 'Latitude']
    cur_long = san_fran_data.at[index, 'Longitude']
    
    dlon = cur_long - russian_hill_long
    dlat = cur_lat - russian_hill_lat
    
    a = sin(dlat / 2)**2 + cos(russian_hill_lat) * cos(cur_lat) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c
    
    san_fran_data.at[index, 'Dist. to Russian Hill'] = distance
    

san_fran_data.head()

Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Nob Hill,Dist. to Russian Hill
0,Bayview Hunters Point,37.741286,-122.377633,408.55122,450.852043
1,Bernal Heights,37.742986,-122.415804,320.429024,363.90654
2,Castro/Upper Market,37.760856,-122.434957,241.516606,274.475014
3,Chinatown,37.794301,-122.406376,56.684316,77.294286
4,Excelsior,37.721794,-122.435382,473.124883,512.244753


In [20]:
# Create a dataframe for Nob Hill that contains the top 5 closest neighborhoods

nob_hill_closest = san_fran_data[['Neighborhood', 'Latitude', 'Longitude', 'Dist. to Nob Hill']].sort_values(by=['Dist. to Nob Hill']).reset_index(drop=True).head(6)
nob_hill_closest = nob_hill_closest.drop(nob_hill_closest.index[0]).reset_index(drop=True)
nob_hill_closest

Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Nob Hill
0,Russian Hill,37.800073,-122.417094,44.954766
1,Chinatown,37.794301,-122.406376,56.684316
2,Tenderloin,37.784249,-122.413993,57.989573
3,North Beach,37.801175,-122.409002,64.130234
4,Japantown,37.785579,-122.429809,104.582943


In [21]:
# Create a dataframe for Russian Hill that contains the top 5 closest neighborhoods

russian_hill_closest = san_fran_data[['Neighborhood', 'Latitude', 'Longitude', 'Dist. to Russian Hill']].sort_values(by=['Dist. to Russian Hill']).reset_index(drop=True).head(6)
russian_hill_closest = russian_hill_closest.drop(russian_hill_closest.index[0]).reset_index(drop=True)
russian_hill_closest

Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Russian Hill
0,Nob Hill,37.793262,-122.415249,44.954766
1,North Beach,37.801175,-122.409002,51.783916
2,Chinatown,37.794301,-122.406376,77.294286
3,Tenderloin,37.784249,-122.413993,102.745466
4,Marina,37.799793,-122.435205,114.846588


### Get number of Italian restaurants and total number of restaurants for each neighborhood

In [22]:
from pandas.io.json import json_normalize

# Category IDs for both all restaurants and italian restaurants from the Foursquare API
all_restaurants = '4d4b7105d754a06374d81259'
italian_restaurants = '4bf58dd8d48988d110941735'
radius = 175
LIMIT = 50


def count_restaurants(latitude, longitude, category):
    url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&categoryId={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, category, radius, LIMIT)

    results = requests.get(url).json()

    # Assign relevant part of JSON to venues
    venues = results['response']['venues']

    # Tranform venues into a dataframe
    dataframe = json_normalize(venues)
    
    return len(dataframe.index)

In [23]:
# Get the number of Italian restaurants, the total number of restaurants, and the frequency of Italian restaurants for Nob Hill
for index in range(0, len(nob_hill_closest.index)):
    cur_lat = nob_hill_closest.at[index, 'Latitude']
    cur_long = nob_hill_closest.at[index, 'Longitude']
    
    count_all_restaurants = count_restaurants(cur_lat, cur_long, all_restaurants)
    count_italian_restaurants = count_restaurants(cur_lat, cur_long, italian_restaurants)
    density = count_italian_restaurants / count_all_restaurants
    
    nob_hill_closest.at[index, 'Total No. of Restaurants'] = count_all_restaurants
    nob_hill_closest.at[index, 'No. of Italian Restaurants'] = count_italian_restaurants
    nob_hill_closest.at[index, 'Italian Restaurant Frequency'] = density
    
nob_hill_closest    



Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Nob Hill,Total No. of Restaurants,No. of Italian Restaurants,Italian Restaurant Frequency
0,Russian Hill,37.800073,-122.417094,44.954766,7.0,0.0,0.0
1,Chinatown,37.794301,-122.406376,56.684316,49.0,1.0,0.020408
2,Tenderloin,37.784249,-122.413993,57.989573,38.0,0.0,0.0
3,North Beach,37.801175,-122.409002,64.130234,46.0,11.0,0.23913
4,Japantown,37.785579,-122.429809,104.582943,48.0,0.0,0.0


In [24]:
# Get the number of Italian restaurants, the total number of restaurants, and the frequency of Italian restaurants for Russian Hill
for index in range(0, len(russian_hill_closest.index)):
    cur_lat = russian_hill_closest.at[index, 'Latitude']
    cur_long = russian_hill_closest.at[index, 'Longitude']
    
    count_all_restaurants = count_restaurants(cur_lat, cur_long, all_restaurants)
    count_italian_restaurants = count_restaurants(cur_lat, cur_long, italian_restaurants)
    density = count_italian_restaurants / count_all_restaurants
    
    russian_hill_closest.at[index, 'Total No. of Restaurants'] = count_all_restaurants
    russian_hill_closest.at[index, 'No. of Italian Restaurants'] = count_italian_restaurants
    russian_hill_closest.at[index, 'Italian Restaurant Frequency'] = density
    
russian_hill_closest



Unnamed: 0,Neighborhood,Latitude,Longitude,Dist. to Russian Hill,Total No. of Restaurants,No. of Italian Restaurants,Italian Restaurant Frequency
0,Nob Hill,37.793262,-122.415249,44.954766,10.0,1.0,0.1
1,North Beach,37.801175,-122.409002,51.783916,46.0,11.0,0.23913
2,Chinatown,37.794301,-122.406376,77.294286,49.0,1.0,0.020408
3,Tenderloin,37.784249,-122.413993,102.745466,38.0,0.0,0.0
4,Marina,37.799793,-122.435205,114.846588,45.0,2.0,0.044444


## Results and Discussion

Our analysis shows that for both Russian Hill and Nob Hill, there is a clear outlier that fits the question we sought to answer: *Which neighborhood has the highest number of restaurants and the highest ratio of Italian restaurants to total number of restaurants?* 

Before looking at the results, let's summarize exactly how we were able to get to this point in our analysis. We first started out by determining which neighborhoods in San Francisco share the most similarity to New York City's SoHo neighborhood, where Italian chef Jimmy De Luca currently lives. The Foursquare API was then used to determine the top venues within each neighborhood and the K-Means clustering algorithm was then applied to group these neighborhoods by venue similarity. Once accomplished, it was determined that both *Nob Hill* and *Russian Hill* are two great contenders for neighborhoods in San Francisco Jimmy should highly consider moving to.

The next task at hand was to help Jimmy decide where he should open his next Italian restaurant. Three criteria guided us on finding the most optimal location:
1. Closeness to his new San Francisco neighborhood (either Nob Hill or Russian Hill)
2. Neighborhoods with a high concentration of restaurants (indicating a demand for food exists)
3. Neighborhoods with a high concentration of Italian restaurants (indicating a market exists for Italian food)

To address the first criterion, a distance formula was used (leveraging the latitude and longitude coordinates) to arrive at a dataframe with the top five closest neighborhoods to both Nob Hill and Russian Hill. Addressing the second and third criteria, the Foursquare API helped determine the total number of restaurants that exist within each area (using a radius of 175 meters) as well as how many Italian restaurants encompass this total. An *Italian Restaurant Frequency* calculated column was also included, allowing us to quickly determine which area has the highest concentration of Italian restaurants. 

Turning to the results, it is clear that **North Beach** is the best choice in both the Nob Hill table and the Russian Hill table. Let's look at the numbers to understand why. First observing the Nob Hill table, we can see in the *Total No. of Restaurants* column there is a lack of restaurants in Russian Hill, the closest neighborhood to Nob Hill. The remaining four neighborhoods all have a relatively high number of restaurants, but after observing both the *No. of Italian Restaurants* column and the *Italian Restaurant Frequency* column we can see that North Beach clearly has both the highest number of Italian restaurants in its area (11 to be exact) as well as the highest concentration (standing at nearly 25%). 

Similar results can be found if we now turn our attention to the Russian Hill table. Again, looking first at the *Total No. of Restaurants* column, Nob Hill clearly dwarfs the other four neighborhoods in total restaurant count and so can be quickly eliminated from the list. Now turning to the *No. of Italian Restaurants* and *Italian Restaurant Frequency* column, it is clear that **North Beach** once again is the obvious winner among the five contenders due to the fact that, as observed in the Nob Hill table, it has the highest number and greatest density of Italian restaurants. 

## Conclusion

The purpose of this project was to find an area in San Francisco, CA for Chef Jimmy De Luca to move to that has similar characteristics to his current home: the beautiful Soho neighborhood in New York City. This was accomplished with the help of the Foursquare API, used to gather the ten most popular spots around each San Francisco neighborhood and using the K-Means algorithm to cluster the neighborhoods based on venue similarity. This led us to find that *Nob Hill* and *Russian Hill* share the most common characteristics to New York City's SoHo neighborhood. These are the two neighborhoods we suggest Jimmy should consider moving to in order to feel more at home.

After we found two contenders for Jimmy's next home, we used a distance formula to calculate the closest neighborhoods to him where he could possibly open up his next hit Italian restaurant. To determine which of these areas might provide Jimmy's restaurant with the highest chance of success, we leveraged the Foursquare API to look for the nearby neighborhoods that have a *high number of restaurants* and a *high number of Italian food restaurants*, which should be a strong indication that food (specifically, Italian food) is in demand in the area. This led us to find that *North Beach* fits the description for both the aforementioned criteria. North Beach has a relatively high number of total restaurants compared to other contenders (standing at 46 restaurants within a 175 meter radius from the epicenter) and the highest concentration of Italian restaurants making up these venues (rounding to roughly 25% of total restaurants). 

We believe this data-driven approach should provide Jimmy De Luca sufficient reasoning as to why he should consider Nob Hill or Russian Hill as his next place to live and why North Beach could likely grant him success if he opens up his next Italian restaurant there. Many other factors (including crime rate and educational opportunities for his kids) were not in scope for this analysis and should be considered when Jimmy finalizes his plans to move across the country. 