# VENUE PROFILE OF BOOKSTORES IN SAN FRANCISCO 

**by We You Toh**

## INTRODUCTION

In my casual daily observation, I find that where there is a bookstore, I am usually able to find a coffee shop nearby pretty easily. Putting this into the perspective of business intelligence, it could be a useful proposition to find out whether it is a common phenomenon that bookstores and coffee shops are close to each other. Since my focus will be placed on bookstores, I also plan to test out if it is common that bookstores are near each other. These are the objectives of this project.

In this project, we will use the Foursquare API to build a *"venue profile"* for the bookstores in the city of San Francisco. With the venue information gathered and with the help of the Folium library, we will place markers on the map of San Francisco to visualize the bookstore locations. We even go further to compare San Francisco's profile for the bookstores against New York City's. Finally, we will run a few statistical tests to compare the significance of differences in venue profiles.

>*__Note:__ This notebook is publicly shared on github repository. The Folium interactive maps and some Markdown features don't display on github the same way as they do on a local host. To interact with these features, it may be necessary to download the jupyter notebook and host it locally. If you wish, you may re-run the whole notebook on your own. You may also wish to provide your own Foursquare API credentials.*
>
> *__Tip:__ If you run the notebook on a local host, set the notebook to 'Trusted' to enable javascript display, otherwise the Folium maps may not display properly.*

<div class="alert alert-block alert-info" style="margin-top: 0px"></div>

## TABLE OF CONTENTS

1. [**ANALYSIS APPROACH**](#item1) 
2. [**DATA COLLECTION**](#item2)  
3. [**ANALYSIS OF DATASET**](#item3)  
4. [**MORE DATA COLLECTION**](#item4)  
5. [**COMPARISON OF BOOKSTORES IN SAN FRANCISCO AND NEW YORK**](#item5)
6. [**CONCLUSION**](#item6)


<div class="alert alert-block alert-info" style="margin-top: 0px"></div>

Before we get the data and start exploring it, let's download all the dependencies that we will need.

In [1]:
# For web-scraping
import requests # library to handle Foursquare API requests

# For data-handling
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON data format
import html # library to handle html data format

# For geographical-related tasks
# !conda install -c conda-forge haversine --yes # uncomment this line if haversine has not been installed.
from haversine import haversine # Calculate distance between two locations, given their latitude and longitude values.

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium has not been installed.
import folium # map rendering library
from folium.plugins import HeatMap

# Statistical Computations and Tests
from statsmodels.stats.proportion import proportion_confint
from statsmodels.stats.proportion import proportions_ztest

print('Libraries imported.')

Libraries imported.


<a id='item1'></a>

## ANALYSIS APPROACH

### The Business Problem

As mentioned earlier, I've chosen to work on the questions of whether it is a common phenomenon that 
1. bookstores are near one another.  
2. bookstores and coffee shops are close to one another
  

The quantifiable way of using data to look at the questions will be to find out the proportion of bookstores near coffee shops, and the proportion of bookstores near one another.

This analysis will follow a descriptive approach to provide the information we want.

### Data Requirement

To achieve the analysis objectives, I will be gathering the following data:
1. Venue/location data. 
    - This will come from Foursquare and the data collected will be stored in a dataframe.
2. Proximity data. 
    - This feature will be engineered using the venue/location data. Each venue will be checked to determine separately if other bookstores, or coffee shops, are nearby. The results will be stored in separate columns, which will then be appended to the dataframe.


For consistency in the data collection, the following quantifiable specifications are determined:
1. How near is "near"? 
    - We'll quantify this to be 250m radius (equivalent to about 1 to 2 city block).
2. What is the specific set of latitude and longitude to use? 
    - We will rely on Foursquare's defined location for San Francisco city.
3. How far should the search cover? 
    - To ensure consistency in the search process, we'll set the coverage to a 4000m radius, which should cover San Francisco city substantially.


The steps taken in the can be summarized as such:
1. Data gathering/storing
    - Gather venue data from Foursquare and store in a dataframe.
2. Data preparation
    - Carry out logical tests if there are other bookstores or coffee shops nearby for each bookstore, and append the results to the dataframe.
3. Data analysis 
    - Compute the statistics: Calculate the proportion of bookstores near each other/coffee shops.
    - Data visualization: Place the bookstores and coffee shops on a map to provide a visual sense of their closeness to one another.

Similarly, we repeat the steps to gather another dataset for New York city. 

#### Statistical Test

In the final step, we'll carry out a few tests to compare the results between San Francisco and New York. Since we are working with proportions, and we'll be sampling two sets of data, it will be appropriate to carry out two-sample proportions Z-test on the datasets.


<a id='item2'></a>

## DATA COLLECTION

#### Foursquare API User Authentication

To use Foursquare to get the venue information we need, we require three values for authentication in an API call: Client ID, Client Secret and Version.

A python dict named 'creds' is used to store Client ID and Client Secret. It contains two keys: 'id' and 'secret'. To avoid misuse of these credentials, the contents of 'creds' have been hidden. Reviewers of this project are encouraged to use their own Foursquare credentials to follow along this project. Foursquare credentials can be obtained through https://developer.foursquare.com.

'20181201' is the version used in this project.

In [2]:
# The code was removed by Watson Studio for sharing.

Storing Foursquare credentials in a python dict: dict_keys(['id', 'secret'])
Foursquare version used in this project: '20181201'


#### Helper Functions

These intermediary functions are defined to support the `get4SquareVenues` function.

In [3]:
def getCount(results):
    # Input is from the query response, in JSON data format. 
    print('\nQuery on `{}` yields {} result(s) found'.format(results['query'], results['totalResults']))
    return(results['totalResults'])

def createDatarow(item):
    """
    Extraction of item data in JSON format. Returns a formatted datarow in tuple format.
    """
    array = (
        item['venue']['name'],
        item['venue']['id'],
        item['venue']['location']['lat'], 
        item['venue']['location']['lng'],
        item['venue']['categories'][0]['name']
    )
    return(array)

def getResponse(city, radius, query_type, query_items, credentials, version, i=0, limit=50):
    """
    Arguments:
      city: string
      radius: float or int
      query_type: string (either `query` or `categoryId`)
      query_items: list of strings
      credentials: dict of 2 keys - `id`, `secret`
      version: string
      i: int (used to offset the results returned)
      limit: int
    """
    id = credentials['id']
    secret = credentials['secret']

    # API query call
    url = 'https://api.foursquare.com/v2/venues/explore?'\
            'near={}&{}={}&client_id={}&client_secret={}&v={}&offset={}&limit={}'\
            .format(city, query_type,','.join(query_items), id, secret, version, i*limit, limit)
    
    url = url + '&radius={}'.format(radius) if radius is not None else None
    
#     print(url)

    try:
        results = requests.get(url).json()
    except:
        print('Error in foursquareResponse.')
        return(results['meta'])
    return(results['response'])



`get4SquareVenues` is the defined below. Calls are made through this function to retrieve the list of Foursquare venues.

In [4]:
def get4SquareVenues(city, query_type, query_items, credentials, radius=2000, limit=50, version='20181201'):
    """
    Foursquare results are returned from this function.
    But Foursquare limits the number of results returned in each API call.
    So the variable `limit` is used to retrieve the results beyond the first call.
    
    Arguments:
        See notes in function exploreResponse(*args)
    """
    
    first_search = getResponse(city, radius, \
                               query_type, query_items, \
                               credentials, version) # Use default values of `i` and `limit`.
    counts = getCount(first_search)
    
#     return(None)

    # Compile list of venues
    venues_list = []
    for i in range(0, int(np.ceil(counts/limit))):
        # Query offsets are achieved with the variable 'i'.
        results = getResponse(city, radius, query_type, query_items, credentials, version, i, limit)
        results = results['groups'][0]['items']
        for item in results:
            temp = createDatarow(item)
            venues_list.append(temp)

    if not venues_list:
        print('No results collected.')
        return(None)    
    else:
        venues = pd.DataFrame(venues_list)       
        venues.columns = ['Name','Venue ID','Lat','Lng', 'Category']

        print('{} result(s) collected'.format(len(venues)))

        return(venues)

#### Data Collection

In [5]:
# Set query parameters
city = 'San Francisco, CA'
coverage = 4000 # in meters

In [6]:
# Gather the venues matching the other queries
# Use default values of `limit` and `version`

query = ['bookstore']
df_bookstore = get4SquareVenues(city, 'query', query, creds, radius = coverage) 

df_bookstore.head()


Query on `bookstore` yields 131 result(s) found
131 result(s) collected


Unnamed: 0,Name,Venue ID,Lat,Lng,Category
0,"Books, Inc.",4bc77b370050b71392c7b83b,37.781614,-122.420531,Bookstore
1,Aardvark Books,4a467a5df964a520eca81fe3,37.767131,-122.428906,Bookstore
2,Dog Eared Books,49e79762f964a520d0641fe3,37.758446,-122.421366,Bookstore
3,Dog Eared Books,574f605b498ee997a05fa2d7,37.761206,-122.434959,Bookstore
4,City Lights Bookstore,49bc2d06f964a5201d541fe3,37.797708,-122.406489,Bookstore


In [7]:
# Gather the venues matching the other queries
# Use default values of `limit` and `version`

query = ['coffee shop']
df_coffee = get4SquareVenues(city, 'query', query, creds, radius = coverage) 

df_coffee.head()


Query on `coffee shop` yields 183 result(s) found
183 result(s) collected


Unnamed: 0,Name,Venue ID,Lat,Lng,Category
0,Blue Bottle Coffee,5560dbdb498e91a2bcde84f6,37.776286,-122.416867,Coffee Shop
1,Blue Bottle Coffee,43d3901ef964a5201f2e1fe3,37.776407,-122.423251,Coffee Shop
2,Arlequin Cafe & Food To Go,49eb8289f964a520ea661fe3,37.777059,-122.422683,Café
3,Ritual Coffee Roasters,4dd94350e4cd37c893d8146f,37.776476,-122.424281,Coffee Shop
4,20th Century Cafe,51d9ade3498eeedbc8f06c0a,37.774903,-122.422434,Café


It is brought to my attention that there is variability in the `Category` column.

For the purpose of this project, we shall assume that the results returned from Foursquare is good, accurate and consistent, in this case, 'Coffee Shop' and 'Café' are equivalent, and the data collected do not need additional cleaning/scrubbing.

<a id='item3'></a>

## ANALYSIS OF DATASET

#### Helper Functions

These functions are related to gathering information about a venue within its proximity.

In [8]:
def is_near(venue1,venue2,distance):    
    # requires haversine library
    # from haversine import haversine
    """ 
    inputs:
    venue1, venue2 -- a tuple/list of latitude, longitude, e.g. [37.77493, -122.41942] 
    disance -- a float, units in kilometers 
    """
    return(haversine(venue1,venue2) <= distance)

def count_is_near(venue1, list_venues2, distance):
    # Count of the number of venues within distance from venue1
    tt = [is_near(venue1, venue2, distance) for venue2 in list_venues2]
    return(sum(tt))

In [9]:
# Set parameter
proximity = 0.25 # Use this to check for venues within 250m from target. This is roughly about 1 to 2 city block radius.

### Proximity Analysis

In [10]:
lat_lng_bookstores = df_bookstore[['Lat','Lng']].apply(lambda x: tuple(x), axis = 1)

proximity_check = lat_lng_bookstores.apply(count_is_near, args=(lat_lng_bookstores.tolist(),proximity)) \
                                    .apply(lambda x: (x-1) != 0) # x - 1 to discount own location

# Replace column, if it already exists
label = 'Near Same Kind'
df_bookstore.drop(columns =[label], inplace=True) \
     if label in df_bookstore.columns \
     else None
df_bookstore = df_bookstore.join(proximity_check.rename(label))

print('Proportions of bookstores near the other bookstores:  %.3f' % proximity_check.mean())


ll_coffee = df_coffee[['Lat','Lng']].apply(lambda x: tuple(x), axis = 1)

proximity_check = ll_coffee.apply(count_is_near, args=(ll_coffee.tolist(),proximity)) \
                                    .apply(lambda x: (x-1) != 0)

print('Proportions of coffee shops near same kind:  %.3f' % proximity_check.mean())

Proportions of bookstores near the other bookstores:  0.687
Proportions of coffee shops near same kind:  0.710


In [11]:
proximity_check = lat_lng_bookstores.apply(count_is_near, args=(ll_coffee.tolist(),proximity)) \
                                    .apply(lambda x: x != 0)

# Replace column, if it already exists
label = 'Near Coffee Shops'
df_bookstore.drop(columns =[label], inplace=True) \
     if label in df_bookstore.columns \
     else None
df_bookstore = df_bookstore.join(proximity_check.rename(label))

print('Proportions of bookstores near coffee shops:  %.3f' % proximity_check.mean())


Proportions of bookstores near coffee shops:  0.710


A quick check back on the dataframe.

In [12]:
df_bookstore

Unnamed: 0,Name,Venue ID,Lat,Lng,Category,Near Same Kind,Near Coffee Shops
0,"Books, Inc.",4bc77b370050b71392c7b83b,37.781614,-122.420531,Bookstore,True,True
1,Aardvark Books,4a467a5df964a520eca81fe3,37.767131,-122.428906,Bookstore,True,True
2,Dog Eared Books,49e79762f964a520d0641fe3,37.758446,-122.421366,Bookstore,True,True
3,Dog Eared Books,574f605b498ee997a05fa2d7,37.761206,-122.434959,Bookstore,True,True
4,City Lights Bookstore,49bc2d06f964a5201d541fe3,37.797708,-122.406489,Bookstore,True,True
5,The Green Arcade,4a934850f964a520701f20e3,37.773162,-122.421903,Bookstore,False,True
6,Alexander Book Company,4ac29d66f964a520fe9920e3,37.788671,-122.400626,Bookstore,True,True
7,Browser Books,49f5ebdcf964a520d46b1fe3,37.789766,-122.434111,Bookstore,False,True
8,Borderlands Books,4aa32a35f964a520904320e3,37.759207,-122.42151,Bookstore,True,True
9,The Booksmith,42a78680f964a52017251fe3,37.769821,-122.449363,Bookstore,True,True


### Visual Analysis

We'll be using the Folium library to create a couple of maps to help us visualize the dataset.

#### Map Center

In [13]:
# An initial query to get Foursquare's latitude and longitude values for San Francisco. This helps to center the map.
query = 'https://api.foursquare.com/v2/venues/explore?'\
        'near={}'\
        '&client_id={}&client_secret={}&v={}'\
        .format(city, creds['id'],creds['secret'],VERSION)
map_center = requests.get(query).json()['response']['geocode']['center']
print(map_center)

{'lat': 37.77493, 'lng': -122.41942}


#### Heat Map of Bookstores in San Francisco

In [14]:
venue_latlng = np.stack((df_bookstore['Lat'],df_bookstore['Lng']),axis = 1).tolist()
heat_map = folium.Map(location=[map_center['lat'], map_center['lng']], zoom_start=13, tiles='Cartodb Positron')
HeatMap(venue_latlng).add_to(heat_map)
heat_map

This heat map indicates to me that the bookstores are concentrated in certain areas within the city.

#### Marked Locations of Bookstores and Coffee Shops in San Francisco

In [15]:
map_city = folium.Map(location=[map_center['lat'], map_center['lng']], zoom_start=14, tiles='Cartodb Positron')

# Function to create circle markers on Foliump map
def add_marker(datarow, marker_color, map, showLabel = False):
    if showLabel:
        label = '''
                <b>{}</b><br>
                Near other bookstores: {}<br>  
                Near coffee shops: {}<br>
                '''.format(html.escape(datarow['Name']), 
                           datarow['Near Same Kind'], 
                           datarow['Near Coffee Shops']
                          )
        label = folium.Popup(label)
    folium.Circle(
        (datarow['Lat'],datarow['Lng']),
        radius = 5, # in meters
        popup=label if showLabel else None,
        color = marker_color,
        fill = True,
        fill_color = marker_color,
        fill_opacity = 0.7).add_to(map)
    
    return

# Create map with location markers
df_bookstore.apply(add_marker, axis=1, args=('blue',map_city, True))
df_coffee.apply(add_marker, axis=1, args=('red',map_city))
map_city

This map gives me a visual indication that it is indeed not uncommon to find bookstores near coffee shops.

In [16]:
# Computing the confidence interval of the population proportion value
# Proportion of bookstores near other bookstores

count = df_bookstore['Near Same Kind'].sum()
nobs =  df_bookstore['Near Same Kind'].count()
proportion_confint(count, nobs, alpha=0.05, method='normal')

(0.6076167158751714, 0.7664290856515462)

In [17]:
# Computing the confidence interval of the population proportion value
# Proportion of bookstores near coffee shops

count = df_bookstore['Near Coffee Shops'].sum()
nobs =  df_bookstore['Near Coffee Shops'].count()
proportion_confint(count, nobs, alpha=0.05, method='normal')

(0.6322141094583753, 0.7876332187858994)

<a id='item4'></a>

## MORE DATA COLLECTION

Next, I will be comparing San Francisco with another city. I have chosen New York because the number of bookstores and coffee shops in New York recorded in the Foursquare database will be sufficient for the analysis.

#### Helper Function

`dataset_by_city` is essentially a wrapper function that assembles the steps taken in the analysis we've just done above. Creating this function allows me to update the `city` parameter and store the results conveniently.

In [18]:
def dataset_by_city(city, creds, radius=4000, proximity_distance=0.25):
    df_bookstore = get4SquareVenues(city, 'query', ['bookstore'], creds, radius)
    df_coffee = get4SquareVenues(city, 'query', ['coffee shop'], creds, radius = coverage)

    ll_bookstores = df_bookstore[['Lat','Lng']].apply(lambda x: tuple(x), axis = 1)
    ll_coffee = df_coffee[['Lat','Lng']].apply(lambda x: tuple(x), axis = 1)

    # Storing proximity data
    proximity_check = ll_bookstores.apply(count_is_near, args=(ll_bookstores.tolist(),proximity_distance)) \
                                   .apply(lambda x: (x-1) !=0) # x - 1 to discount own location
    label = 'Near Same Kind'
    df_bookstore = df_bookstore.join(proximity_check.rename(label))

    print('\nProportion of bookstores near the other bookstores: %.3f' % proximity_check.mean())

    # Storing proximity data
    proximity_check = ll_bookstores.apply(count_is_near, args=(ll_coffee.tolist(),proximity_distance)) \
                                   .apply(lambda x: x !=0)

    label = 'Near Coffee Shops'
    df_bookstore = df_bookstore.join(proximity_check.rename(label))

    print('Proportion of bookstores near coffee shops: %.3f' % proximity_check.mean())
    
    return(df_bookstore)

#### Re-creation of San Francisco Dataset

In [19]:
df_SanFrancisco = dataset_by_city('San Francisco, CA', creds) # Persistent data. May require a re-run.


Query on `bookstore` yields 131 result(s) found
131 result(s) collected

Query on `coffee shop` yields 183 result(s) found
183 result(s) collected

Proportion of bookstores near the other bookstores: 0.687
Proportion of bookstores near coffee shops: 0.710


We're able to reproduce the same numbers for San Francisco, so the `dataset_by_city` function is working fine. We'll proceed to run the same queries for New York.

In [24]:
df_NewYork = dataset_by_city('New York, NY', creds) # Persistent data. May require a re-run.


Query on `bookstore` yields 203 result(s) found
203 result(s) collected

Query on `coffee shop` yields 162 result(s) found
162 result(s) collected

Proportion of bookstores near the other bookstores: 0.749
Proportion of bookstores near coffee shops: 0.655


The proportions collected for New York seems to be quite close to those collected for San Francisco. In the next segment, the statistical tests are conducted to tell us whether the difference in these proportion values are significant.

In [25]:
# Storing the datasets as local files. 

df_SanFrancisco.to_csv('df_SanFrancisco.csv')
df_NewYork.to_csv('df_NewYork.csv')

<a id='item5'></a>

## COMPARISON OF BOOKSTORES IN SAN FRANCISCO AND NEW YORK

The two-sample proportion Z-test is deployed to compare the proportions of the bookstore data in the two cities. The following assumptions allows the statistical test to be carried out meaningfully:

1. The samples are independent.
2. Each sample includes at least 5 successes (i.e. 'True') and 5 failures (i.e. 'False').

In [26]:
# 2-proportion Z-Test
# Compare the proportions of bookstores between SF and NY
# which are located within 250m (1-block) from at least one other bookstore

c_1, c_2 = df_SanFrancisco['Near Same Kind'].sum(), df_NewYork['Near Same Kind'].sum()
n_1, n_2 = df_SanFrancisco['Near Same Kind'].count(), df_NewYork['Near Same Kind'].count()

count = [c_1, c_2] # No. of entries w/"True" values
nobs = [n_1, n_2] # No. of observations

print(count)
print(nobs)

pooled_pr = (c_1 + c_2)/(n_1 + n_2)
std_err = np.sqrt(pooled_pr * (1 - pooled_pr)*(1.0/n_1 + 1.0/n_2)) 

print('\nPooled sample proportion: %.3f' % pooled_pr)
print('Standard error: %.3f' % std_err)

# Test Statistic and P-value
# Level of significance : .05
proportions_ztest(count, nobs, alternative = 'two-sided')

[90, 152]
[131, 203]

Pooled sample proportion: 0.725
Standard error: 0.050


(-1.2332783930033908, 0.21747191637795837)

Statistically based on the two-sample proportion Z-test, using the Foursquare data collected with a coverage radius of 4000m, the proportion of bookstores that are within 250m from another bookstore in San Francisco (*.687*) is significantly different from that of the bookstores in New York (*.749*), *z = -1.233, p = .217*.

In [27]:
# 2-proportion Z-Test
# Compare the proportions of bookstores between SF and NY
# which are located within 250m (1-block) from at least one coffee shop

c_1, c_2 = df_SanFrancisco['Near Coffee Shops'].sum(), df_NewYork['Near Coffee Shops'].sum()
n_1, n_2 = df_SanFrancisco['Near Coffee Shops'].count(), df_NewYork['Near Coffee Shops'].count()

count = [c_1, c_2] # No. of entries w/"True" values
nobs = [n_1, n_2] # No. of observations

print(count)
print(nobs)

pooled_pr = (c_1 + c_2)/(n_1 + n_2)
std_err = np.sqrt(pooled_pr * (1 - pooled_pr)*(1.0/n_1 + 1.0/n_2)) 

print('\nPooled sample proportion: %.3f' % pooled_pr)
print('Standard error: %.3f' % std_err)

# Test Statistic and P-value
# Level of significance : .05
proportions_ztest(count, nobs, alternative = 'two-sided')


[93, 133]
[131, 203]

Pooled sample proportion: 0.677
Standard error: 0.052


(1.0444432422744574, 0.29628036589411855)

Statistically based on the two-sample proportion Z-test, using the Foursquare data collected with a coverage radius of 4000m, the proportion of bookstores that are within 250m from a coffee shop in San Francisco (*.71*) is not statistically significantly different from that of the bookstores in New York (*.655*), *z = 1.044, p = .296*.

<a id='item6'></a>

## CONCLUSION

This analysis has provided a brief look at the venue profile of bookstores in San Francisco. While the data collected are sampled proportions, I would argue that we may use the proportion values as estimates in predicting the probability of finding another bookstore or a coffee shop nearby. For business owners, I think the proportion values are useful indicators on 
1. the level of competition from neighboring stores offering the same service.
2. the level of complementary services around a target location.

If we have profit/loss data and foot traffic data to go with the data collected, we could inspect further at what level of competition is beneficial to the service provided, or at what kind of complementary services are good to have around.

In this analysis, I've only provided the proportion of bookstores near other bookstores or coffee shops. It is clear that the analysis need not be just bookstores and coffee shops, but can be expanded to include other choices of store type.

As I compare between San Francisco and New York, I learn through the statistical tests that it's a mixed bag in terms of the similarity in the venue profiles of bookstores in these two cities. It should be noted that this analysis only holds under the conditions of a 4000m coverage radius and a proxmity distance of 250m. It is also subjected to the limitations held by the Foursquare data.

While this project shall conclude here, it is greatly encouraged for users to vary the parameters used, and extend the comparison exercise to include other cities or localities. It may be present many opportunities to discover new interesting information and insights.



**Thank you for checking out this project. I hope you have found it enjoyable.**  
*-This project has been done by We You Toh.-*