# Capstone Data Science Project - with Foursquare API

## Introduction/Business Problem

Although modern vegan and vegetarian cuisine already is and becomes more popular in San Diego, CA, it seems like there is still much space for fine-dining plant-based restaurants and that some areas in San Diego County could be a good pick for embodying that idea.

The purpose of that project is to analyze chosen areas in San Diego County to find out where it could be a best choice to open upscale vegetarian/vegan restaurant oriented to serve breakfasts and brunches with cafe section, to encourage customers to not only come in for a dinner or lunch, but also to spend more time during sunny, laid back days hanging out in a high quality enviromantal friendly surroundings.

While exploring in search for a good area to start that kind of business we consider:
+ amount of restaurants nearby
+ amount of vegetarian/vegan restaurants nearby
+ median household income in the neighborhood
+ other interesting attractions around

## Data

Following resourses will be used to extract informations needed:

+ **Google Maps API geocoding** to find geolocations of points of interests 
+ **Foursquare API** for exploring neighborhoods, their venues, restaurants and attractions
+ **Median Household Income for San Diego County from the Census Bureau from datausa.io** website - csv including census geoid and median household income
+ **FCC Api** to convert geolocations to census geoid, to extract neighborhoods of interests from the Median Household Income csv


In [274]:
import requests
import pandas as pd
import numpy as np
import folium
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

In [124]:
GOOGLE_API_KEY=''

## Gathering data about neighborhoods of interest

We list neighborhoods which are chosen as potentiall areas for opening the above described business

In [423]:
areas = ['West F Street, Encinitas, California',
'13th St, Del Mar, California',
'Girard Avenue, Village of La Jolla, California',
'Mission Blvd, Pacific Beach, California',
'University Av, Hillcrest, San Diego, California',
'Orange Ave, Coronado, California',
'North Park Way, North Park, San Diego, California',
'Rosecrans St, Point Loma, San Diego, California',
'Plaza St, Solana Beach, California',
'300 Mission Ave, Oceanside, California',
'600 Carlsbad Village Drive, Carlsbad, California',
'600 Fifth Avenue, San Diego, California',
'1900 India Street, San Diego, California']

#### Getting informations about areas geolocation using google geocoding API

In [424]:
def get_coords(areas, neighborhoods):
    neighborhoods = pd.DataFrame([])    
    for area in areas:
        url = 'https://maps.googleapis.com/maps/api/geocode/json?key={}&address={}'.format(GOOGLE_API_KEY, area)
        response = requests.get(url).json()
        zipcode = response['results'][0]['address_components'][5]['short_name'] if len(response['results'][0]['address_components']) < 7 else response['results'][0]['address_components'][6]['short_name'] 
        neighborhoods = neighborhoods.append({
            'col1': response['results'][0]['address_components'][0]['short_name'],
            'col2': response['results'][0]['address_components'][1]['short_name'],
            'col3': response['results'][0]['address_components'][2]['short_name'],
            'col4': response['results'][0]['address_components'][3]['short_name'],
            'col5': response['results'][0]['address_components'][4]['short_name'],
            'col6': response['results'][0]['address_components'][5]['short_name'],
            'col7': zipcode,
            'lat': response['results'][0]['geometry']['location']['lat'],
            'lng': response['results'][0]['geometry']['location']['lng']
        }, ignore_index=True)
    return neighborhoods

In [426]:
df_neighborhoods = get_coords(areas, neighborhoods)

#### Cleaning dataframe from unnecessary informations and organizing gathered data

In [430]:
df_neighborhoods.drop(['col6', 'col5'], axis=1, inplace=True)

In [435]:
df_neighborhoods.iat[9, 0] = '300 Mission Ave'
df_neighborhoods.iat[9, 1] = 'Oceanside'
df_neighborhoods.iat[10, 0] = '600 Carlsbad Village Dr'
df_neighborhoods.iat[10, 1] = 'Carlsbad'
df_neighborhoods.iat[11, 0] = '600 Fifth Ave'
df_neighborhoods.iat[11, 1] = 'Gaslamp Quarter'
df_neighborhoods.iat[12, 0] = '1900 India St'
df_neighborhoods.iat[12, 1] = 'Little Italy'

In [None]:
df_neighborhoods.drop(['col3', 'col4'], axis=1, inplace=True)

In [437]:
df_neighborhoods.rename(columns={'col1': 'street', 'col2': 'neighborhood', 'col7': 'zipcode'}, inplace=True)

#### Income data
We also want to have informations about the household yearly income in the area, so we can consider that indicator when comparing neighborhoods looking to choose a place to open more expensive venue and have more of target customers around. 
We firstly get data about the census block by latitude and longitude from Federal Communications Commision API and then find the average household income by census block in the data from the Census Bureau from the datausa.io website.  


In [439]:
df_neighborhoods['geoid'] = 0
for i, hood in enumerate(df_neighborhoods.iterrows()):
    area_url = 'https://geo.fcc.gov/api/census/area?lat={}&lon={}&format=json'.format(hood[1]['lat'], hood[1]['lng'])
    results = requests.get(area_url).json()
    
    df_neighborhoods.iat[i,5] = results['results'][0]['block_fips'][:11]

In [440]:
df_income = pd.read_csv('income.csv')

In [456]:
df_neighborhoods['income'] = 0
for i, hood in enumerate(df_neighborhoods.iterrows()):
    geoid = str(hood[1]['geoid'])
    area_income = df_income[df_income['ID Geography'].str.contains(geoid)]
    income = area_income[area_income['Year'] == 2018]['Household Income by Race']
    df_neighborhoods.iat[i,6] = income

In [457]:
df_neighborhoods

Unnamed: 0,street,neigborhood,zipcode,lat,lng,geoid,income
0,W F St,Encinitas,92024,33.043216,-117.294944,6073017501,112770
1,13th St,Del Mar,92014,32.957393,-117.265067,6073017200,110625
2,Girard Ave,Village of La Jolla,92037,32.843179,-117.273384,6073008200,83878
3,Mission Blvd,Pacific Beach,92109,32.79368,-117.254593,6073007910,70074
4,University Ave,Hillcrest,US,32.748499,-117.154809,6073000600,65089
5,Orange Ave,Coronado,92118,32.690154,-117.177271,6073010900,115987
6,North Park Way,North Park,92104,32.747411,-117.127709,6073001500,76887
7,Rosecrans St,Point Loma,US,32.724732,-117.229103,6073021400,72992
8,Plaza St,Solana Beach,92075,32.991698,-117.272541,6073017303,162500
9,300 Mission Ave,Oceanside,US,33.195067,-117.381072,6073018400,48004


We now have all the informations about the areas needed for our analysis.

#### Let's take a look at the map to see where the areas are situated

In [471]:
map_sd = folium.Map(location=[df_neighborhoods.iloc[3]['lat'], df_neighborhoods.iloc[3]['lng']], zoom_start=9)

for lat, lng, neighborhood in zip(df_neighborhoods['lat'], df_neighborhoods['lng'], df_neighborhoods['neighborhood']):
    label = '{}'.format(neighborhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='green',
        fill=True,
        fill_color='#008000',
        fill_opacity=0.7,
        parse_html=False).add_to(map_sd)  
    
map_sd

## Venues data
Next step will be to gather informations about nearby venues around our points of interests. We will look in the range of 1000m.
For that purpose we use Foursquare API.

In [236]:
# credentials and consts for Foursquare API
CLIENT_ID = '' 
CLIENT_SECRET = ''
VERSION = '20180605'
LIMIT = 1000

In [239]:
#function from the course Foursquare tutorial to get venues categories types
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

In [240]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

  This is separate from the ipykernel package so we can avoid doing imports until


In [241]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [242]:
sd_venues = getNearbyVenues(names=df_neighborhoods['neighborhood'],
                                   latitudes=df_neighborhoods['lat'],
                                   longitudes=df_neighborhoods['lng']
                                  )

Encinitas
Del Mar
Village of La Jolla
Pacific Beach
Hillcrest
Coronado
North Park
Point Loma
Solana Beach
Oceanside
Carlsbad
Gaslamp Quarter
Little Italy


In [281]:
sd_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Encinitas,33.043216,-117.294944,The Taco Stand,33.044087,-117.293727,Taco Place
1,Encinitas,33.043216,-117.294944,URBN Encinitas,33.042527,-117.293442,Pizza Place
2,Encinitas,33.043216,-117.294944,Better Buzz Coffee,33.044951,-117.293862,Coffee Shop
3,Encinitas,33.043216,-117.294944,East Village Asian Dinner,33.044256,-117.293801,Asian Restaurant
4,Encinitas,33.043216,-117.294944,Lotus Cafe & Juice Bar,33.041832,-117.293072,Vegetarian / Vegan Restaurant
...,...,...,...,...,...,...,...
827,Little Italy,32.724423,-117.168588,Portal Coffee,32.720760,-117.170815,Coffee Shop
828,Little Italy,32.724423,-117.168588,Urban Boutique Hotel,32.722696,-117.167549,Hotel
829,Little Italy,32.724423,-117.168588,Roma,32.720775,-117.167572,Grocery Store
830,Little Italy,32.724423,-117.168588,La Pensione Hotel,32.723234,-117.168444,Hotel


In [243]:
sd_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,100,100,100,100,100,100
Coronado,37,37,37,37,37,37
Del Mar,38,38,38,38,38,38
Encinitas,43,43,43,43,43,43
Gaslamp Quarter,100,100,100,100,100,100
Hillcrest,56,56,56,56,56,56
Little Italy,100,100,100,100,100,100
North Park,58,58,58,58,58,58
Oceanside,75,75,75,75,75,75
Pacific Beach,60,60,60,60,60,60


In [404]:
sd_venues[sd_venues['Venue Category'] == 'Coffee Shop'].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,4,4,4,4,4,4
Coronado,1,1,1,1,1,1
Del Mar,1,1,1,1,1,1
Encinitas,3,3,3,3,3,3
Gaslamp Quarter,5,5,5,5,5,5
Hillcrest,4,4,4,4,4,4
Little Italy,8,8,8,8,8,8
North Park,4,4,4,4,4,4
Oceanside,4,4,4,4,4,4
Pacific Beach,3,3,3,3,3,3


In [405]:
sd_venues[sd_venues['Venue Category'] == 'Bakery'].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,1,1,1,1,1,1
Encinitas,1,1,1,1,1,1
Gaslamp Quarter,1,1,1,1,1,1
North Park,1,1,1,1,1,1
Oceanside,1,1,1,1,1,1
Point Loma,1,1,1,1,1,1
Village of La Jolla,1,1,1,1,1,1


In [406]:
sd_venues[sd_venues['Venue Category'] == 'Breakfast Spot'].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,3,3,3,3,3,3
Del Mar,1,1,1,1,1,1
Encinitas,2,2,2,2,2,2
Gaslamp Quarter,3,3,3,3,3,3
Little Italy,2,2,2,2,2,2
North Park,1,1,1,1,1,1
Oceanside,2,2,2,2,2,2
Pacific Beach,2,2,2,2,2,2
Point Loma,1,1,1,1,1,1
Solana Beach,2,2,2,2,2,2


In [407]:
sd_venues[sd_venues['Venue Category'] == 'Vegetarian / Vegan Restaurant'].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Encinitas,2,2,2,2,2,2
Little Italy,1,1,1,1,1,1
North Park,1,1,1,1,1,1


In [409]:
sd_venues[sd_venues['Venue Category'] == 'Café'].groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Carlsbad,6,6,6,6,6,6
Coronado,1,1,1,1,1,1
Del Mar,1,1,1,1,1,1
Encinitas,1,1,1,1,1,1
Gaslamp Quarter,4,4,4,4,4,4
Hillcrest,1,1,1,1,1,1
Little Italy,2,2,2,2,2,2
North Park,3,3,3,3,3,3
Oceanside,2,2,2,2,2,2
Pacific Beach,4,4,4,4,4,4


In [372]:
# one hot encoding
sd_onehot = pd.get_dummies(sd_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sd_onehot['Neighborhood'] = sd_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sd_onehot.columns[-1]] + list(sd_onehot.columns[:-1])
sd_onehot = sd_onehot[fixed_columns]

sd_onehot.head()

Unnamed: 0,Yoga Studio,ATM,Accessories Store,Adult Boutique,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,...,Toy / Game Store,Train Station,Turkish Restaurant,Vegetarian / Vegan Restaurant,Video Store,Vietnamese Restaurant,Whisky Bar,Wine Bar,Wine Shop,Women's Store
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [373]:
sd_onehot = sd_onehot[['Neighborhood', 'American Restaurant', 'Argentinian Restaurant', 'Art Gallery',
       'Art Museum', 'Arts & Crafts Store', 'Asian Restaurant', 'Bagel Shop', 'Bakery', 'Brazilian Restaurant', 'Breakfast Spot',
       'Café', 'Cocktail Bar', 'Coffee Shop', 'French Restaurant', 'Italian Restaurant',
       'Japanese Restaurant','Latin American Restaurant', 'New American Restaurant','Park','Other Great Outdoors', 'Restaurant', 'Seafood Restaurant', 
        'South American Restaurant', 'Steakhouse', 'Sushi Restaurant', 'Tapas Restaurant', 'Thai Restaurant', 'Theater','Vegetarian / Vegan Restaurant','Whisky Bar', 'Wine Bar']]

In [374]:
sd_grouped = sd_onehot.groupby('Neighborhood').mean().reset_index()
sd_grouped

Unnamed: 0,Neighborhood,American Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Bagel Shop,Bakery,Brazilian Restaurant,...,Seafood Restaurant,South American Restaurant,Steakhouse,Sushi Restaurant,Tapas Restaurant,Thai Restaurant,Theater,Vegetarian / Vegan Restaurant,Whisky Bar,Wine Bar
0,Carlsbad,0.04,0.0,0.0,0.0,0.0,0.02,0.01,0.01,0.0,...,0.01,0.0,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.02
1,Coronado,0.027027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.027027,0.0,0.0,0.0
2,Del Mar,0.078947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.026316,0.0,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.052632
3,Encinitas,0.069767,0.0,0.023256,0.0,0.0,0.023256,0.0,0.023256,0.0,...,0.023256,0.023256,0.0,0.0,0.0,0.023256,0.023256,0.046512,0.0,0.023256
4,Gaslamp Quarter,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.02,...,0.02,0.0,0.06,0.03,0.01,0.0,0.01,0.0,0.01,0.01
5,Hillcrest,0.035714,0.0,0.0,0.0,0.0,0.0,0.017857,0.0,0.0,...,0.0,0.0,0.0,0.035714,0.017857,0.071429,0.0,0.0,0.0,0.0
6,Little Italy,0.02,0.01,0.0,0.0,0.01,0.0,0.0,0.0,0.0,...,0.01,0.0,0.01,0.0,0.01,0.0,0.0,0.01,0.0,0.06
7,North Park,0.017241,0.0,0.017241,0.0,0.0,0.0,0.0,0.017241,0.0,...,0.0,0.0,0.0,0.034483,0.0,0.0,0.0,0.017241,0.017241,0.017241
8,Oceanside,0.053333,0.0,0.0,0.0,0.0,0.0,0.013333,0.013333,0.0,...,0.013333,0.0,0.013333,0.013333,0.0,0.013333,0.026667,0.0,0.0,0.0
9,Pacific Beach,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.05,0.0,0.0,0.016667,0.0,0.016667,0.0,0.0,0.0,0.0


In [375]:
num_top_venues = 5

for hood in sd_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sd_grouped[sd_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Carlsbad----
                 venue  freq
0   Italian Restaurant  0.06
1                 Café  0.06
2  American Restaurant  0.04
3          Coffee Shop  0.04
4       Breakfast Spot  0.03


----Coronado----
                 venue  freq
0                 Park  0.05
1  American Restaurant  0.03
2              Theater  0.03
3                 Café  0.03
4          Coffee Shop  0.03


----Del Mar----
                 venue  freq
0  American Restaurant  0.08
1   Italian Restaurant  0.08
2             Wine Bar  0.05
3       Breakfast Spot  0.03
4                 Café  0.03


----Encinitas----
                           venue  freq
0            American Restaurant  0.07
1                    Coffee Shop  0.07
2             Italian Restaurant  0.07
3  Vegetarian / Vegan Restaurant  0.05
4                 Breakfast Spot  0.05


----Gaslamp Quarter----
                 venue  freq
0           Steakhouse  0.06
1  American Restaurant  0.05
2          Coffee Shop  0.05
3                 Café  0.04

In [376]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [377]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sd_grouped['Neighborhood']

for ind in np.arange(sd_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sd_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Carlsbad,Café,Italian Restaurant,American Restaurant,Coffee Shop,Restaurant,Breakfast Spot,Asian Restaurant,Wine Bar,Theater,Sushi Restaurant
1,Coronado,Park,American Restaurant,Theater,Café,Coffee Shop,French Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store
2,Del Mar,American Restaurant,Italian Restaurant,Wine Bar,Café,Sushi Restaurant,Seafood Restaurant,Park,Latin American Restaurant,Breakfast Spot,Coffee Shop
3,Encinitas,American Restaurant,Coffee Shop,Italian Restaurant,Vegetarian / Vegan Restaurant,Breakfast Spot,New American Restaurant,Art Gallery,Asian Restaurant,Bakery,Café
4,Gaslamp Quarter,Steakhouse,American Restaurant,Coffee Shop,Café,Breakfast Spot,Sushi Restaurant,Italian Restaurant,Brazilian Restaurant,Seafood Restaurant,New American Restaurant


In [378]:
# set number of clusters
kclusters = 5

sd_grouped_clustering = sd_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sd_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 4, 3, 0, 0, 0], dtype=int32)

In [379]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sd_merged = df_neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
sd_merged = sd_merged.merge(neighborhoods_venues_sorted.set_index('Neighborhood'), left_on='neighborhood', right_on='Neighborhood')

sd_merged.head() # check the last columns!

Unnamed: 0,street,neighborhood,zipcode,lat,lng,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,W F St,Encinitas,92024,33.043216,-117.294944,6073017501,112770,1,American Restaurant,Coffee Shop,Italian Restaurant,Vegetarian / Vegan Restaurant,Breakfast Spot,New American Restaurant,Art Gallery,Asian Restaurant,Bakery,Café
1,13th St,Del Mar,92014,32.957393,-117.265067,6073017200,110625,1,American Restaurant,Italian Restaurant,Wine Bar,Café,Sushi Restaurant,Seafood Restaurant,Park,Latin American Restaurant,Breakfast Spot,Coffee Shop
2,Girard Ave,Village of La Jolla,92037,32.843179,-117.273384,6073008200,83878,2,Coffee Shop,Café,Italian Restaurant,Breakfast Spot,Seafood Restaurant,Thai Restaurant,American Restaurant,Art Museum,Sushi Restaurant,Steakhouse
3,Mission Blvd,Pacific Beach,92109,32.79368,-117.254593,6073007910,70074,0,Café,American Restaurant,Coffee Shop,Seafood Restaurant,Breakfast Spot,Thai Restaurant,Sushi Restaurant,Restaurant,Bagel Shop,Brazilian Restaurant
4,University Ave,Hillcrest,US,32.748499,-117.154809,6073000600,65089,4,Coffee Shop,Thai Restaurant,Restaurant,Sushi Restaurant,American Restaurant,Tapas Restaurant,Other Great Outdoors,Bagel Shop,Italian Restaurant,Café


In [380]:
# create map
map_clusters = folium.Map(location=[df_neighborhoods.iloc[3]['lat'], df_neighborhoods.iloc[3]['lng']], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sd_merged['lat'], sd_merged['lng'], sd_merged['neighborhood'], sd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [381]:
sd_merged.loc[sd_merged['Cluster Labels'] == 0, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Pacific Beach,6073007910,70074,0,Café,American Restaurant,Coffee Shop,Seafood Restaurant,Breakfast Spot,Thai Restaurant,Sushi Restaurant,Restaurant,Bagel Shop,Brazilian Restaurant
5,Coronado,6073010900,115987,0,Park,American Restaurant,Theater,Café,Coffee Shop,French Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store
6,North Park,6073001500,76887,0,Coffee Shop,Café,Sushi Restaurant,Wine Bar,Whisky Bar,Art Gallery,Bakery,Breakfast Spot,French Restaurant,American Restaurant
8,Solana Beach,6073017303,162500,0,Café,Coffee Shop,American Restaurant,Breakfast Spot,Sushi Restaurant,Seafood Restaurant,Thai Restaurant,Park,Bakery,Brazilian Restaurant
9,Oceanside,6073018400,48004,0,American Restaurant,Coffee Shop,Theater,Breakfast Spot,Café,Restaurant,Bagel Shop,Bakery,Italian Restaurant,New American Restaurant


In [382]:
sd_merged.loc[sd_merged['Cluster Labels'] == 1, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Encinitas,6073017501,112770,1,American Restaurant,Coffee Shop,Italian Restaurant,Vegetarian / Vegan Restaurant,Breakfast Spot,New American Restaurant,Art Gallery,Asian Restaurant,Bakery,Café
1,Del Mar,6073017200,110625,1,American Restaurant,Italian Restaurant,Wine Bar,Café,Sushi Restaurant,Seafood Restaurant,Park,Latin American Restaurant,Breakfast Spot,Coffee Shop
10,Carlsbad,6073017900,70500,1,Café,Italian Restaurant,American Restaurant,Coffee Shop,Restaurant,Breakfast Spot,Asian Restaurant,Wine Bar,Theater,Sushi Restaurant
11,Gaslamp Quarter,6073005300,54879,1,Steakhouse,American Restaurant,Coffee Shop,Café,Breakfast Spot,Sushi Restaurant,Italian Restaurant,Brazilian Restaurant,Seafood Restaurant,New American Restaurant


In [383]:
sd_merged.loc[sd_merged['Cluster Labels'] == 2, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Village of La Jolla,6073008200,83878,2,Coffee Shop,Café,Italian Restaurant,Breakfast Spot,Seafood Restaurant,Thai Restaurant,American Restaurant,Art Museum,Sushi Restaurant,Steakhouse
7,Point Loma,6073021400,72992,2,Coffee Shop,Seafood Restaurant,Italian Restaurant,Sushi Restaurant,Wine Bar,Bakery,Breakfast Spot,New American Restaurant,Restaurant,American Restaurant


In [384]:
sd_merged.loc[sd_merged['Cluster Labels'] == 3, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
12,Little Italy,6073005800,104696,3,Italian Restaurant,Coffee Shop,Wine Bar,New American Restaurant,Park,Breakfast Spot,Café,American Restaurant,Japanese Restaurant,Argentinian Restaurant


In [385]:
sd_merged.loc[sd_merged['Cluster Labels'] == 4, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
4,Hillcrest,6073000600,65089,4,Coffee Shop,Thai Restaurant,Restaurant,Sushi Restaurant,American Restaurant,Tapas Restaurant,Other Great Outdoors,Bagel Shop,Italian Restaurant,Café


In [390]:
# set number of clusters
kclusters = 5

sd_grouped_clustering = sd_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sd_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([1, 0, 1, 1, 1, 4, 3, 0, 0, 0], dtype=int32)

In [391]:
sd_grouped_clustering['income'] = df_neighborhoods['income']

In [392]:
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sd_grouped_clustering)


In [393]:
kmeans.labels_[0:10] 

array([1, 1, 3, 4, 4, 1, 3, 4, 2, 0], dtype=int32)

In [394]:
neighborhoods_venues_sorted.drop('Cluster Labels',1,inplace=True)

In [395]:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sd_merged = df_neighborhoods

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
sd_merged = sd_merged.merge(neighborhoods_venues_sorted.set_index('Neighborhood'), left_on='neighborhood', right_on='Neighborhood')

sd_merged.head() # check the last columns!

Unnamed: 0,street,neighborhood,zipcode,lat,lng,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,W F St,Encinitas,92024,33.043216,-117.294944,6073017501,112770,4,American Restaurant,Coffee Shop,Italian Restaurant,Vegetarian / Vegan Restaurant,Breakfast Spot,New American Restaurant,Art Gallery,Asian Restaurant,Bakery,Café
1,13th St,Del Mar,92014,32.957393,-117.265067,6073017200,110625,3,American Restaurant,Italian Restaurant,Wine Bar,Café,Sushi Restaurant,Seafood Restaurant,Park,Latin American Restaurant,Breakfast Spot,Coffee Shop
2,Girard Ave,Village of La Jolla,92037,32.843179,-117.273384,6073008200,83878,1,Coffee Shop,Café,Italian Restaurant,Breakfast Spot,Seafood Restaurant,Thai Restaurant,American Restaurant,Art Museum,Sushi Restaurant,Steakhouse
3,Mission Blvd,Pacific Beach,92109,32.79368,-117.254593,6073007910,70074,0,Café,American Restaurant,Coffee Shop,Seafood Restaurant,Breakfast Spot,Thai Restaurant,Sushi Restaurant,Restaurant,Bagel Shop,Brazilian Restaurant
4,University Ave,Hillcrest,US,32.748499,-117.154809,6073000600,65089,1,Coffee Shop,Thai Restaurant,Restaurant,Sushi Restaurant,American Restaurant,Tapas Restaurant,Other Great Outdoors,Bagel Shop,Italian Restaurant,Café


In [396]:
# create map
map_clusters = folium.Map(location=[df_neighborhoods.iloc[3]['lat'], df_neighborhoods.iloc[3]['lng']], zoom_start=10)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sd_merged['lat'], sd_merged['lng'], sd_merged['neighborhood'], sd_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

In [397]:
sd_merged.loc[sd_merged['Cluster Labels'] == 0, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
3,Pacific Beach,6073007910,70074,0,Café,American Restaurant,Coffee Shop,Seafood Restaurant,Breakfast Spot,Thai Restaurant,Sushi Restaurant,Restaurant,Bagel Shop,Brazilian Restaurant
8,Solana Beach,6073017303,162500,0,Café,Coffee Shop,American Restaurant,Breakfast Spot,Sushi Restaurant,Seafood Restaurant,Thai Restaurant,Park,Bakery,Brazilian Restaurant


In [398]:
sd_merged.loc[sd_merged['Cluster Labels'] == 1, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
2,Village of La Jolla,6073008200,83878,1,Coffee Shop,Café,Italian Restaurant,Breakfast Spot,Seafood Restaurant,Thai Restaurant,American Restaurant,Art Museum,Sushi Restaurant,Steakhouse
4,Hillcrest,6073000600,65089,1,Coffee Shop,Thai Restaurant,Restaurant,Sushi Restaurant,American Restaurant,Tapas Restaurant,Other Great Outdoors,Bagel Shop,Italian Restaurant,Café
5,Coronado,6073010900,115987,1,Park,American Restaurant,Theater,Café,Coffee Shop,French Restaurant,Argentinian Restaurant,Art Gallery,Art Museum,Arts & Crafts Store
10,Carlsbad,6073017900,70500,1,Café,Italian Restaurant,American Restaurant,Coffee Shop,Restaurant,Breakfast Spot,Asian Restaurant,Wine Bar,Theater,Sushi Restaurant


In [399]:
sd_merged.loc[sd_merged['Cluster Labels'] == 2, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Oceanside,6073018400,48004,2,American Restaurant,Coffee Shop,Theater,Breakfast Spot,Café,Restaurant,Bagel Shop,Bakery,Italian Restaurant,New American Restaurant


In [400]:
sd_merged.loc[sd_merged['Cluster Labels'] == 3, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
1,Del Mar,6073017200,110625,3,American Restaurant,Italian Restaurant,Wine Bar,Café,Sushi Restaurant,Seafood Restaurant,Park,Latin American Restaurant,Breakfast Spot,Coffee Shop
12,Little Italy,6073005800,104696,3,Italian Restaurant,Coffee Shop,Wine Bar,New American Restaurant,Park,Breakfast Spot,Café,American Restaurant,Japanese Restaurant,Argentinian Restaurant


In [401]:
sd_merged.loc[sd_merged['Cluster Labels'] == 4, sd_merged.columns[[1] + list(range(5, sd_merged.shape[1]))]]


Unnamed: 0,neighborhood,geoid,income,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Encinitas,6073017501,112770,4,American Restaurant,Coffee Shop,Italian Restaurant,Vegetarian / Vegan Restaurant,Breakfast Spot,New American Restaurant,Art Gallery,Asian Restaurant,Bakery,Café
6,North Park,6073001500,76887,4,Coffee Shop,Café,Sushi Restaurant,Wine Bar,Whisky Bar,Art Gallery,Bakery,Breakfast Spot,French Restaurant,American Restaurant
7,Point Loma,6073021400,72992,4,Coffee Shop,Seafood Restaurant,Italian Restaurant,Sushi Restaurant,Wine Bar,Bakery,Breakfast Spot,New American Restaurant,Restaurant,American Restaurant
11,Gaslamp Quarter,6073005300,54879,4,Steakhouse,American Restaurant,Coffee Shop,Café,Breakfast Spot,Sushi Restaurant,Italian Restaurant,Brazilian Restaurant,Seafood Restaurant,New American Restaurant


In [361]:
# get bakeries, vegan restaurants and breakfast spots there are
# analyze each neighborhood - cluster - income?
# get how many reatuarants in each candidate neighborhood in general
# get income for each candidate - add to clustering / cluster by income
