<h1>Coursera Data Science Capstone</h1>

## Table of contents
* [Introduction: Business Problem](#introduction)
* [Data](#data)
* [Methodology](#methodology)
* [Analysis](#analysis)
* [Results and Discussion](#results)
* [Conclusion](#conclusion)

## Introduction: Business Problem <a name="introduction"></a>

<p>An investor wants to open 3 businesses in an up and coming town Livermore, CA. The investor wants to model how the Marina neighborhood in San Francisco, CA has developed. In order to determine what to open, we will treat each neighborhood of San Francisco as its own entity. Come up with the most common businesses for each neighborhood of San Francisco. Then we will compare the Marina neighborhood's top businesses with Livermore's current top busniesses to determine the gap</p>

## Data <a name="data"></a>

Based on definition of our problem, factors that will influence our decission are:
* Top 10 Businesses in the Marina Neighborhood of San Francisco
* Top 10 Businesses in Livermore, CA

Following data sources will be needed to extract/generate the required information:
* centers of candidate areas will be generated algorithmically and approximate addresses of centers of those areas will be obtained using **pgeocode**
* number of Businesses and their type and location in every neighborhood will be obtained using **Foursquare API**


## Methodology <a name="methodology"></a>

In this project we will direct our efforts Livermore, CA and the Marina District of San Francisco, CA. We will limit our analysis to area ~500M around city center.

In first step we have collected the required **data: location and type (category) of every business within 500m from each neighborhood of San Francisco center**.

In Second step we have collected the required **data: location and type (category) of every business within 500m from Livermore center**.

In third and final step we will create **clusters of locations in San Francisco** established in discussion with stakeholders: we will model locations with **Marina District**

## Analysis <a name="analysis"></a>

<p>copyright Bijan Shokrollahi</p>

<p>Imports</p>

In [4]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

<p>Scraping the internet for San Francisco Neighborhoods and their respectice Zip Code</p>

In [6]:
response = requests.get("http://www.healthysf.org/bdi/outcomes/zipmap.htm")
soup = BeautifulSoup(response.text, "lxml")
table = soup.find_all("table")
df = pd.read_html(str(table))
df = pd.DataFrame(df[4])

In [7]:
df

Unnamed: 0,0,1,2
0,Zip Code,Neighborhood,Population (Census 2000)
1,94102,Hayes Valley/Tenderloin/North of Market,28991
2,94103,South of Market,23016
3,94107,Potrero Hill,17368
4,94108,Chinatown,13716
5,94109,Polk/Russian Hill (Nob Hill),56322
6,94110,Inner Mission/Bernal Heights,74633
7,94112,Ingelside-Excelsior/Crocker-Amazon,73104
8,94114,Castro/Noe Valley,30574
9,94115,Western Addition/Japantown,33115


<p>Using pgeocode to determine longitude and latitude of zip codes</p>

In [14]:
import pgeocode
nomi = pgeocode.Nominatim('US')
pac_heights = nomi.query_postal_code("94115")

In [10]:
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header
df

Unnamed: 0,Zip Code,Neighborhood,Population (Census 2000)
1,94102,Hayes Valley/Tenderloin/North of Market,28991
2,94103,South of Market,23016
3,94107,Potrero Hill,17368
4,94108,Chinatown,13716
5,94109,Polk/Russian Hill (Nob Hill),56322
6,94110,Inner Mission/Bernal Heights,74633
7,94112,Ingelside-Excelsior/Crocker-Amazon,73104
8,94114,Castro/Noe Valley,30574
9,94115,Western Addition/Japantown,33115
10,94116,Parkside/Forest Hill,42958


In [12]:
df = df[:-1]
df

Unnamed: 0,Zip Code,Neighborhood,Population (Census 2000)
1,94102,Hayes Valley/Tenderloin/North of Market,28991
2,94103,South of Market,23016
3,94107,Potrero Hill,17368
4,94108,Chinatown,13716
5,94109,Polk/Russian Hill (Nob Hill),56322
6,94110,Inner Mission/Bernal Heights,74633
7,94112,Ingelside-Excelsior/Crocker-Amazon,73104
8,94114,Castro/Noe Valley,30574
9,94115,Western Addition/Japantown,33115
10,94116,Parkside/Forest Hill,42958


In [13]:
df = df.drop("Population (Census 2000)", axis=1)
df

Unnamed: 0,Zip Code,Neighborhood
1,94102,Hayes Valley/Tenderloin/North of Market
2,94103,South of Market
3,94107,Potrero Hill
4,94108,Chinatown
5,94109,Polk/Russian Hill (Nob Hill)
6,94110,Inner Mission/Bernal Heights
7,94112,Ingelside-Excelsior/Crocker-Amazon
8,94114,Castro/Noe Valley
9,94115,Western Addition/Japantown
10,94116,Parkside/Forest Hill


<p>(Converting df to dict as this is operationally faster than iterating the df and updating as updates make a copy of the df)
    <br>
    Getting the longitude and latitude of the neighborhoods
</p>

In [17]:
df_dict = df.to_dict("results")
for item in df_dict:
    result = nomi.query_postal_code(item.get("Zip Code"))
    item["latitude"] = result.latitude
    item["longitude"] = result.longitude

In [18]:
df = pd.DataFrame(df_dict)
df

Unnamed: 0,Zip Code,Neighborhood,latitude,longitude
0,94102,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167
1,94103,South of Market,37.7725,-122.4147
2,94107,Potrero Hill,37.7621,-122.3971
3,94108,Chinatown,37.7929,-122.4079
4,94109,Polk/Russian Hill (Nob Hill),37.7917,-122.4186
5,94110,Inner Mission/Bernal Heights,37.7509,-122.4153
6,94112,Ingelside-Excelsior/Crocker-Amazon,37.7195,-122.4411
7,94114,Castro/Noe Valley,37.7587,-122.433
8,94115,Western Addition/Japantown,37.7856,-122.4358
9,94116,Parkside/Forest Hill,37.7441,-122.4863


<p>Visualize each neighborhood using Folium</p>

In [24]:
import folium
map_sf = folium.Map(location=[37.7792808,-122.4192363],zoom_start=12)

for lat,long,neighbourhood in zip(df['latitude'],df['longitude'],df['Neighborhood']):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
    [lat,long],
    radius=5,
    popup=label,
    color='blue',
    fill=True,
    fill_color='#3186cc',
    fill_opacity=0.7,
    parse_html=False).add_to(map_sf)
map_sf

In [86]:
CLIENT_ID = 'client_id' # your Foursquare ID
CLIENT_SECRET = 'client_secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
radius = 500 # define radius

In [None]:
neighborhood_latitude = df.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = df.loc[0, 'Longitude'] # neighborhood longitude value

neighborhood_name = df.loc[0, 'Neighborhood'] # neighborhood name
# type your answer here
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    neighborhood_latitude, 
    neighborhood_longitude, 
    radius, 
    LIMIT)
results = requests.get(url).json()
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

In [27]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

<p>Get the nearby venues for each neighborhood in San Francisco</p>

In [28]:
sf_venues = getNearbyVenues(names=df['Neighborhood'],
                                   latitudes=df['latitude'],
                                   longitudes=df['longitude']
                                  )
sf_venues

Hayes Valley/Tenderloin/North of Market
South of Market
Potrero Hill
Chinatown
Polk/Russian Hill (Nob Hill)
Inner Mission/Bernal Heights
Ingelside-Excelsior/Crocker-Amazon
Castro/Noe Valley
Western Addition/Japantown
Parkside/Forest Hill
Haight-Ashbury
Inner Richmond
Outer Richmond
Sunset
Marina
Bayview-Hunters Point
St. Francis Wood/Miraloma/West Portal
Twin Peaks-Glen Park
Lake Merced
North Beach/Chinatown
Visitacion Valley/Sunnydale


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,Asian Art Museum,37.780178,-122.416505,Art Museum
1,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,Philz Coffee,37.781266,-122.416901,Coffee Shop
2,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,Saigon Sandwich,37.783084,-122.417650,Sandwich Place
3,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,Ales Unlimited: Beer Basement,37.782751,-122.415656,Beer Bar
4,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,Golden Era Vegan,37.781495,-122.416822,Vegetarian / Vegan Restaurant
...,...,...,...,...,...,...,...
1163,Visitacion Valley/Sunnydale,37.7190,-122.4096,Visitacion Valley Greenway,37.717687,-122.407316,Garden
1164,Visitacion Valley/Sunnydale,37.7190,-122.4096,Wilde Overlook,37.718066,-122.412379,Scenic Lookout
1165,Visitacion Valley/Sunnydale,37.7190,-122.4096,Philosopher's Way,37.718133,-122.412717,Trail
1166,Visitacion Valley/Sunnydale,37.7190,-122.4096,Dwight Club,37.721513,-122.411461,Performing Arts Venue


In [29]:
print('There are {} uniques categories.'.format(len(sf_venues['Venue Category'].unique())))

There are 238 uniques categories.


<p>Use one hot encoder to encode the categorical values</p>

In [30]:
# one hot encoding
sf_onehot = pd.get_dummies(sf_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
sf_onehot['Neighborhood'] = sf_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [sf_onehot.columns[-1]] + list(sf_onehot.columns[:-1])
sf_onehot = sf_onehot[fixed_columns]

sf_onehot.head()

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio
0,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Hayes Valley/Tenderloin/North of Market,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [33]:
sf_onehot.shape

(1168, 239)

<p>Get the average of each category grouped by neighborhood</p>

In [34]:
sf_grouped = sf_onehot.groupby('Neighborhood').mean().reset_index()
sf_grouped

Unnamed: 0,Neighborhood,ATM,Accessories Store,Adult Boutique,African Restaurant,American Restaurant,Antique Shop,Argentinian Restaurant,Art Gallery,Art Museum,...,Vegetarian / Vegan Restaurant,Vietnamese Restaurant,Warehouse Store,Whisky Bar,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yemeni Restaurant,Yoga Studio
0,Bayview-Hunters Point,0.0,0.0,0.0,0.0625,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Castro/Noe Valley,0.0,0.0,0.012658,0.0,0.012658,0.0,0.0,0.012658,0.0,...,0.0,0.0,0.0,0.0,0.025316,0.012658,0.0,0.0,0.0,0.025316
2,Chinatown,0.0,0.0,0.0,0.0,0.010526,0.0,0.0,0.0,0.0,...,0.021053,0.021053,0.0,0.010526,0.010526,0.0,0.0,0.0,0.0,0.010526
3,Haight-Ashbury,0.0,0.026316,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.026316,0.0,0.0,0.0,0.052632,0.0,0.0,0.0,0.0,0.026316
4,Hayes Valley/Tenderloin/North of Market,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.013333,...,0.04,0.066667,0.0,0.0,0.013333,0.0,0.0,0.0,0.013333,0.0
5,Ingelside-Excelsior/Crocker-Amazon,0.0,0.0,0.0,0.0,0.023256,0.0,0.0,0.0,0.0,...,0.0,0.046512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,Inner Mission/Bernal Heights,0.0,0.0,0.0,0.0,0.013514,0.0,0.0,0.040541,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,Inner Richmond,0.013699,0.0,0.0,0.0,0.0,0.0,0.0,0.013699,0.0,...,0.0,0.027397,0.0,0.0,0.013699,0.027397,0.0,0.0,0.0,0.0
8,Lake Merced,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,Marina,0.0,0.02,0.0,0.0,0.02,0.0,0.0,0.0,0.0,...,0.01,0.01,0.0,0.0,0.04,0.01,0.0,0.01,0.0,0.02


In [35]:
sf_grouped.shape

(21, 239)

<p>Print top 5 venues of each neighborhood</p>

In [36]:
num_top_venues = 5

for hood in sf_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = sf_grouped[sf_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Bayview-Hunters Point----
                             venue  freq
0  Southern / Soul Food Restaurant  0.19
1               Mexican Restaurant  0.12
2                           Bakery  0.12
3                             Café  0.06
4           Thrift / Vintage Store  0.06


----Castro/Noe Valley----
             venue  freq
0          Gay Bar  0.09
1  Thai Restaurant  0.05
2      Coffee Shop  0.05
3      Yoga Studio  0.03
4            Plaza  0.03


----Chinatown----
                venue  freq
0              Bakery  0.06
1               Hotel  0.05
2         Coffee Shop  0.05
3  Chinese Restaurant  0.04
4  Dim Sum Restaurant  0.03


----Haight-Ashbury----
            venue  freq
0     Coffee Shop  0.11
1  Ice Cream Shop  0.05
2            Park  0.05
3        Wine Bar  0.05
4          Bakery  0.05


----Hayes Valley/Tenderloin/North of Market----
                           venue  freq
0          Vietnamese Restaurant  0.07
1                          Hotel  0.05
2                 Sand

In [37]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

<p>Convert the encoded values to ranking grouped by neighbood, return a df</p>

In [39]:
import numpy as np
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = sf_grouped['Neighborhood']

for ind in np.arange(sf_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(sf_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Bayview-Hunters Point,Southern / Soul Food Restaurant,Bakery,Mexican Restaurant,Pharmacy,Thrift / Vintage Store,Park,BBQ Joint,Gym,Theater,Café
1,Castro/Noe Valley,Gay Bar,Thai Restaurant,Coffee Shop,Arts & Crafts Store,Clothing Store,Convenience Store,Public Art,Deli / Bodega,Plaza,New American Restaurant
2,Chinatown,Bakery,Hotel,Coffee Shop,Chinese Restaurant,Dim Sum Restaurant,Tea Room,Bubble Tea Shop,Szechuan Restaurant,Pizza Place,Men's Store
3,Haight-Ashbury,Coffee Shop,Ice Cream Shop,Wine Bar,Bakery,Park,Comic Shop,Breakfast Spot,Bubble Tea Shop,Souvlaki Shop,Burrito Place
4,Hayes Valley/Tenderloin/North of Market,Vietnamese Restaurant,Hotel,Sandwich Place,Thai Restaurant,Vegetarian / Vegan Restaurant,Theater,Bakery,Café,Beer Bar,Concert Hall


<p>Lets cluster the neighborhoods into 5 clusters based on similarity</p>

In [40]:
from sklearn.cluster import KMeans
# set number of clusters
kclusters = 5

sf_grouped_clustering = sf_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(sf_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([3, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [41]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

sf_merged = df

# merge manhattan_grouped with manhattan_data to add latitude/longitude for each neighborhood
sf_merged = sf_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')

sf_merged.head() # check the last columns!

Unnamed: 0,Zip Code,Neighborhood,latitude,longitude,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,94102,Hayes Valley/Tenderloin/North of Market,37.7813,-122.4167,1,Vietnamese Restaurant,Hotel,Sandwich Place,Thai Restaurant,Vegetarian / Vegan Restaurant,Theater,Bakery,Café,Beer Bar,Concert Hall
1,94103,South of Market,37.7725,-122.4147,1,Coffee Shop,Food Truck,Nightclub,Cocktail Bar,Gay Bar,Sushi Restaurant,Motorcycle Shop,Restaurant,Clothing Store,Thai Restaurant
2,94107,Potrero Hill,37.7621,-122.3971,1,Café,Brewery,Wine Shop,Breakfast Spot,Coffee Shop,Gift Shop,Mexican Restaurant,Grocery Store,Farmers Market,Motorcycle Shop
3,94108,Chinatown,37.7929,-122.4079,1,Bakery,Hotel,Coffee Shop,Chinese Restaurant,Dim Sum Restaurant,Tea Room,Bubble Tea Shop,Szechuan Restaurant,Pizza Place,Men's Store
4,94109,Polk/Russian Hill (Nob Hill),37.7917,-122.4186,1,Bar,Grocery Store,Italian Restaurant,Massage Studio,Yoga Studio,Wine Bar,Vietnamese Restaurant,Café,Bakery,Deli / Bodega


<p>Display each neighborhood color coded by their cluster on the map using Folium</p>

In [46]:
import matplotlib.cm as cm
import matplotlib.colors as colors
# create map
map_clusters = folium.Map(location=[37.7792808,-122.4192363],zoom_start=12)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(sf_merged['latitude'], sf_merged['longitude'], sf_merged['Neighborhood'], sf_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters

<p>Lets take a look at each cluster</p>

In [47]:
sf_merged.loc[sf_merged['Cluster Labels'] == 0, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
9,Parkside/Forest Hill,Chinese Restaurant,Light Rail Station,Dumpling Restaurant,Café,Korean Restaurant,Liquor Store,Sandwich Place,Sushi Restaurant,Bar,Thai Restaurant
12,Outer Richmond,Pizza Place,Sushi Restaurant,Café,Chinese Restaurant,Pharmacy,Bar,Tennis Court,Shanghai Restaurant,Grocery Store,Creperie
13,Sunset,Chinese Restaurant,Bank,Bakery,Dim Sum Restaurant,Train Station,Bubble Tea Shop,Salon / Barbershop,Liquor Store,Supermarket,Sushi Restaurant


In [48]:
sf_merged.loc[sf_merged['Cluster Labels'] == 1, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Hayes Valley/Tenderloin/North of Market,Vietnamese Restaurant,Hotel,Sandwich Place,Thai Restaurant,Vegetarian / Vegan Restaurant,Theater,Bakery,Café,Beer Bar,Concert Hall
1,South of Market,Coffee Shop,Food Truck,Nightclub,Cocktail Bar,Gay Bar,Sushi Restaurant,Motorcycle Shop,Restaurant,Clothing Store,Thai Restaurant
2,Potrero Hill,Café,Brewery,Wine Shop,Breakfast Spot,Coffee Shop,Gift Shop,Mexican Restaurant,Grocery Store,Farmers Market,Motorcycle Shop
3,Chinatown,Bakery,Hotel,Coffee Shop,Chinese Restaurant,Dim Sum Restaurant,Tea Room,Bubble Tea Shop,Szechuan Restaurant,Pizza Place,Men's Store
4,Polk/Russian Hill (Nob Hill),Bar,Grocery Store,Italian Restaurant,Massage Studio,Yoga Studio,Wine Bar,Vietnamese Restaurant,Café,Bakery,Deli / Bodega
5,Inner Mission/Bernal Heights,Mexican Restaurant,Grocery Store,Coffee Shop,Art Gallery,Fish Market,Deli / Bodega,Gym / Fitness Center,Bookstore,Pizza Place,Latin American Restaurant
6,Ingelside-Excelsior/Crocker-Amazon,Pizza Place,Mexican Restaurant,Sandwich Place,Latin American Restaurant,Vietnamese Restaurant,Bar,Bus Station,Café,Burrito Place,Coffee Shop
7,Castro/Noe Valley,Gay Bar,Thai Restaurant,Coffee Shop,Arts & Crafts Store,Clothing Store,Convenience Store,Public Art,Deli / Bodega,Plaza,New American Restaurant
8,Western Addition/Japantown,Bakery,Tea Room,Grocery Store,Gift Shop,Pizza Place,Cosmetics Shop,Ice Cream Shop,Boutique,Spa,Sandwich Place
10,Haight-Ashbury,Coffee Shop,Ice Cream Shop,Wine Bar,Bakery,Park,Comic Shop,Breakfast Spot,Bubble Tea Shop,Souvlaki Shop,Burrito Place


In [49]:
sf_merged.loc[sf_merged['Cluster Labels'] == 2, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
16,St. Francis Wood/Miraloma/West Portal,Trail,Monument / Landmark,Bus Line,Park,Tree,Yoga Studio,Food,Fish Market,Flower Shop,Food Court


In [50]:
sf_merged.loc[sf_merged['Cluster Labels'] == 3, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
15,Bayview-Hunters Point,Southern / Soul Food Restaurant,Bakery,Mexican Restaurant,Pharmacy,Thrift / Vintage Store,Park,BBQ Joint,Gym,Theater,Café


In [51]:
sf_merged.loc[sf_merged['Cluster Labels'] == 4, sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
20,Visitacion Valley/Sunnydale,Trail,Scenic Lookout,Garden,Park,Performing Arts Venue,Music Venue,Electronics Store,Donut Shop,Dog Run,French Restaurant


<p>Create the livermore variables</p>

In [53]:
livermore_long = 37.681160
livermore_lat = -121.773380
livermore = "Livermore"

<h3>Get the Livermore Nearby Venues</h3>

In [65]:
venues_list=[]
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    str(livermore_long), 
    str(livermore_lat), 
    500, 
    LIMIT)

# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']

# return only relevant information for each nearby venue
venues_list.append([(
    livermore, 
    livermore_lat, 
    livermore_long, 
    v['venue']['name'], 
    v['venue']['location']['lat'], 
    v['venue']['location']['lng'],  
    v['venue']['categories'][0]['name']) for v in results])

nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood', 
          'Neighborhood Latitude', 
          'Neighborhood Longitude', 
          'Venue', 
          'Venue Latitude', 
          'Venue Longitude', 
          'Venue Category']

nearby_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,,-121.77338,37.68116,Vine Cinema & Alehouse,37.680147,-121.774872,Indie Movie Theater
1,,-121.77338,37.68116,Story Coffee Co.,37.679797,-121.771807,Café
2,,-121.77338,37.68116,Beach Hut Deli,37.680162,-121.774367,Sandwich Place
3,,-121.77338,37.68116,First Street Alehouse,37.681475,-121.770040,Pub
4,,-121.77338,37.68116,Donut Wheel,37.680896,-121.770819,Donut Shop
...,...,...,...,...,...,...,...
60,,-121.77338,37.68116,Tri-Valley Haven Thrift Store,37.682862,-121.772044,Thrift / Vintage Store
61,,-121.77338,37.68116,Buenas Vidas Thrift Store,37.683056,-121.772049,Thrift / Vintage Store
62,,-121.77338,37.68116,Lewis Grocery,37.684141,-121.772599,Food & Drink Shop
63,,-121.77338,37.68116,Victorine Olive Oil,37.683181,-121.769640,Restaurant


<h3>Fill in neighborhood as 'Livermore' since we are going to treat it as one neighborhood</h3>

In [66]:
livermore_venues = nearby_venues
livermore_venues.loc[:,'Neighborhood'] = 'Livermore'
livermore_venues

Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Livermore,-121.77338,37.68116,Vine Cinema & Alehouse,37.680147,-121.774872,Indie Movie Theater
1,Livermore,-121.77338,37.68116,Story Coffee Co.,37.679797,-121.771807,Café
2,Livermore,-121.77338,37.68116,Beach Hut Deli,37.680162,-121.774367,Sandwich Place
3,Livermore,-121.77338,37.68116,First Street Alehouse,37.681475,-121.770040,Pub
4,Livermore,-121.77338,37.68116,Donut Wheel,37.680896,-121.770819,Donut Shop
...,...,...,...,...,...,...,...
60,Livermore,-121.77338,37.68116,Tri-Valley Haven Thrift Store,37.682862,-121.772044,Thrift / Vintage Store
61,Livermore,-121.77338,37.68116,Buenas Vidas Thrift Store,37.683056,-121.772049,Thrift / Vintage Store
62,Livermore,-121.77338,37.68116,Lewis Grocery,37.684141,-121.772599,Food & Drink Shop
63,Livermore,-121.77338,37.68116,Victorine Olive Oil,37.683181,-121.769640,Restaurant


In [67]:
print('There are {} uniques categories.'.format(len(livermore_venues['Venue Category'].unique())))

There are 43 uniques categories.


<h3>Encode the categorical values</h3>

In [68]:
# one hot encoding
livermore_onehot = pd.get_dummies(livermore_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
livermore_onehot['Neighborhood'] = livermore_venues['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [livermore_onehot.columns[-1]] + list(livermore_onehot.columns[:-1])
livermore_onehot = livermore_onehot[fixed_columns]

livermore_onehot.head()

Unnamed: 0,Neighborhood,ATM,Afghan Restaurant,American Restaurant,Bakery,Bar,Beer Bar,Bookstore,Bowling Alley,Burger Joint,...,Restaurant,Sandwich Place,Smoothie Shop,Sporting Goods Shop,Sushi Restaurant,Thrift / Vintage Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery
0,Livermore,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Livermore,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Livermore,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,Livermore,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Livermore,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [70]:
livermore_onehot.shape

(65, 44)

In [71]:
livermore_grouped = livermore_onehot.groupby('Neighborhood').mean().reset_index()
livermore_grouped

Unnamed: 0,Neighborhood,ATM,Afghan Restaurant,American Restaurant,Bakery,Bar,Beer Bar,Bookstore,Bowling Alley,Burger Joint,...,Restaurant,Sandwich Place,Smoothie Shop,Sporting Goods Shop,Sushi Restaurant,Thrift / Vintage Store,Vietnamese Restaurant,Wine Bar,Wine Shop,Winery
0,Livermore,0.015385,0.015385,0.015385,0.030769,0.030769,0.015385,0.015385,0.015385,0.015385,...,0.046154,0.030769,0.015385,0.015385,0.030769,0.030769,0.015385,0.030769,0.015385,0.015385


In [72]:
num_top_venues = 5

for hood in livermore_grouped['Neighborhood']:
    print("----"+hood+"----")
    temp = livermore_grouped[livermore_grouped['Neighborhood'] == hood].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print('\n')

----Livermore----
                venue  freq
0  Mexican Restaurant  0.08
1         Coffee Shop  0.08
2          Restaurant  0.05
3      Discount Store  0.03
4            Pharmacy  0.03




In [73]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = livermore_grouped['Neighborhood']

for ind in np.arange(livermore_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(livermore_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,Livermore,Mexican Restaurant,Coffee Shop,Restaurant,Pub,Ice Cream Shop,Discount Store,Bar,Pharmacy,Farmers Market,Sandwich Place


In [75]:
# set number of clusters
kclusters = 1

livermore_grouped_clustering = livermore_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(livermore_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 

array([0], dtype=int32)

In [84]:
marina = sf_merged.loc[sf_merged['Neighborhood'] == "Marina", sf_merged.columns[[1] + list(range(5, sf_merged.shape[1]))]]
marina

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Marina,Italian Restaurant,Cosmetics Shop,Gym / Fitness Center,Wine Bar,Spa,Salad Place,French Restaurant,Motel,Steakhouse,Burger Joint


<p>Combine the Marina and Livermore df to visualize each top 10 next to each other, determine the conclusion</p>

In [85]:
merged = marina.append(neighborhoods_venues_sorted)
merged

Unnamed: 0,Neighborhood,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
14,Marina,Italian Restaurant,Cosmetics Shop,Gym / Fitness Center,Wine Bar,Spa,Salad Place,French Restaurant,Motel,Steakhouse,Burger Joint
0,Livermore,Mexican Restaurant,Coffee Shop,Restaurant,Pub,Ice Cream Shop,Discount Store,Bar,Pharmacy,Farmers Market,Sandwich Place


## Results and Discussion <a name="results"></a>

In [None]:
http://localhost:8889/notebooks/coursera_capstone_final.ipynb#Results-and-Discussion-

<h2>Conculsion</h2>
<p>
   Since the investor wants to model their investment decisions based on the Marina Neighborhood in San Francisco, the following are the 3 recommendations on which type of businesses to open
</p>
<ol>
    <li>Italian Restaurant</li>
    <li>Cosmetics Shop</li>
    <li>Gym / Fitness Center</li>
</ol>