# What a Neighborhood Needs

## Introduction

### Background
There is a natural flow of customers between businesses. One might go from a hair salon to a nail salon.  One might stop for drinks after dancing.  Perhap after clothes shopping people like to grab coffee.  People will tend to go from one category of business to another. Ideally, near the one you will find the other.  You can find opportunity where this isn't the case.  For example if we can determine that people like to stop for bubble tea after shoe shopping, and we can find a neighborhood with shoe shop but without bubble tea, that might just be the right to invest in a new bubble tea shop.

### Problem
There are few steps to this analysis. First we create a mapping of customer flow from one category of business to another.  Then we find geographic clusters of businesses to call our neighborhoods.  We can then analyze the categories and popularity of the businesses in each neighborhood to find out what categories of businesses would be expected popular next stops.  We then compare the expected hot categories to the categories existent in the neighborhood to find deficiencies. 

### Interest
The results of this analysis would be of interest to anyone considering starting a business, or investing in a new business and wants to know where existing customers will naturally feed into their business.  

## Data
### Data Source
For this project we restrict our area of interest to the city of Seattle, WA; my hometown.  We use Foursquare Places API https://developer.foursquare.com/docs/api to get information on various businesses. Foursquare has a remarkably complete dataset of businesses and other locations.  Specifically, we use Foursquare Places API `search` call to get a list of businesses and locations, called `venues` by the API. As the call is limited to 50 venues, we use the call in a tight grid across the city, aggregating and removing duplicates in the data.  The key pieces of information this gets us is the unique venue identifier, the primary category associated with the venue (e.g. "German Restaurant"), and the exact location. On each venue we get the number of likes with the `likes` call, and the categories of the next venues with the `nextvenue` call. The `nextvenue` call returns the 5 venues that are most commonly checked into immediately after checking into a given venue as long as 2 hours haven't passed between the checkins. 




In [None]:
import math

import pandas as pd
import numpy as np
import requests
!pip install foursquare
import foursquare
from project_lib import Project
import threading
import json
import pickle

To collect the full dataset, this notebook needs to be run once with `GET_FRESH_VENUES` and `CREATE_INITAL_DATAFRAME` set as `True` , and approximately 10 times with both set as `False`, each time space more than an hour apart. This is to work around the Foursquare hourly quota of 5000 regular calls. Getting the full dataset requires approximately 50000 total calls.  

This Notebook uses IBM Watson project object storage to store files.  If someone wanted to run this in a different environment than IBM Watson Studio, then the file storage would have to be adapted to the new environment. Simply search for the use of `project` and replace those lines as makes sense. 


In [None]:
GET_FRESH_VENUES = False
CREATE_INITAL_DATAFRAME = False

We now prepare ourselves to use the Foursquare and Watson Project APIs

The below hidden cell looks like  
`CLIENT_ID = <My Foursquare Client ID>`  
`CLIENT_SECRET = <My Foursquare Secret>`  
with `<My Foursquare Client ID>` and `<My Foursquare Secret>` replaced with the corresponding quoted values from my Foursquare developer account. 

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
VERSION = 20180605
client = foursquare.Foursquare(client_id=CLIENT_ID, client_secret=CLIENT_SECRET, version=VERSION)

The hidden cell below looks like  
`PROJECT_ID=<My Project ID>`  
`PROJECT_ACCESS_TOKEN=<My Project Access Token>`  
with `<My Project ID>` and `<My Project Access Token>` replaced with my Watson Studio project's ID and it's access token respectively. 

In [None]:
# The code was removed by Watson Studio for sharing.

In [None]:
project = Project(project_id=PROJECT_ID, project_access_token=PROJECT_ACCESS_TOKEN)

### Getting the Venues

#### Preliminary calculations
We call Foursquare `search` in a grid to gather all venues in Seattle.  To do this, first we figure out our grid of latitude and longitudes we will search on. We get the extrema of Seattle manually from Google Maps.

In [None]:

seattle_north = 47.734401
seattle_west = -122.437307
seattle_south = 47.494613
seattle_east = -122.245315

We want to do foursquare searches of approximately 210 meters radius, with the expectation that it is unlikely that more than 50 venues
will be in any such tight space.  This is a guess, but we will double check as we do our searches that we don't max any out. We want to make sure to do a search grid that completely covers the area.  If we do a square grid, how much great a spacing will that allow? Let's reduce it by a bit just to be extra certain of no gaps. 

In [None]:
radius = 210

spacing = 2*radius/(2**(1/2)) /1.05
spacing



How many searches will that require? First we calculate the length in meters of a degree of longitude and latitude around Seattle, using
40000000 as the approximate circumference of the earth in meters. At one time, this was true by the definition of the meter.

In [None]:
degree_long = 40000000/360
degree_lat = degree_long * math.cos(math.pi*seattle_north/180)

seattle_h = (seattle_north - seattle_south)*degree_long
seattle_w = (seattle_east - seattle_west)* degree_lat

grid_h = int(seattle_h/spacing)
grid_w = int(seattle_w/spacing)
grid_h * grid_w

We have a 5000 per hour limit on the number of API calls, it's important that the grid height multiplied by the grid width is below that. 

Now we call `search` in each node in the grid,  we multithread the process to speed things up. We then save the results to permanent storage. 

In [None]:

VENUES_PATH = "venues.json"

if GET_FRESH_VENUES:
    def get_venues(lat, lng, venues_dict):
        result = client.venues.search(params = {'ll':"%f,%f"%(lat, lng), 'radius':'100'})
        venues_list = result["venues"]
        if len(venues_list) >= 50:
            print("Found %d venues at %f, %f"%(len(venues_list), search_lat, search_lng) )
        for venue in result["venues"]:
            venues_dict[venue["id"]] = venue
    venues_dict = {} #We use a dictionary to remove duplicate entries
    threads = []
    for search_lat in np.linspace(seattle_south, seattle_north, grid_h):
        print("Scanning Latitude %f" % search_lat)
        for search_lng in np.linspace(seattle_west, seattle_east, grid_w):
            thread = threading.Thread(target=get_venues, args=(search_lat, search_lng, venues_dict))
            thread.start()
            threads.append(thread)

    for thread in threads:
        thread.join()
        
    venues_json = json.dumps(venues_dict)
    project.save_data(file_name = VENUES_PATH,data = venues_json,set_project_asset=True,overwrite=True)


Load from permanent storage if we didn't just regenerate them.  

In [None]:
if not GET_FRESH_VENUES:
    venues_file = project.get_file(VENUES_PATH)
    venues_dict = json.load(venues_file)
    venues_file.close()

### Creating our Dataframe
We extract the parameters of interest from our raw results, and place them in a dataframe.  We put in dummy values for "likes" and "next categories" to be filled in later. We also filter out any Venues without categories, or not in Seattle. 

In [None]:
DF_PATH = "df.pickle"

def primary_category(venue):
    for cat in venue['categories']:
        if cat['primary']:
            return cat
    return None

if CREATE_INITAL_DATAFRAME:
    columns = ['id', 'name', 'category id', 'category name', 'lat', 'lng', 'likes', 'next categories']
    data = []
    for venue in venues_dict.values():

        cat = primary_category(venue)
        # Strip out anything without a category or that isn't in Seattle
        if (cat is not None and 'city' in venue['location'] 
                and venue['location']['city'] == 'Seattle'):
            data.append((venue['id'], venue['name'], cat['id'], 
                 cat['name'], venue['location']['lat'], 
                 venue['location']['lng'], np.nan, np.nan))

    df = pd.DataFrame(data=data, columns = columns).set_index('id')
    df['next categories'] = df['next categories'].astype(np.object)
else:
    df_file = project.get_file(DF_PATH)
    df = pickle.load(df_file)
    df_file.close()



### Likes and Next Categories

Here we fill in "likes" and "next categories" values in our dataframe.  This can't be done all in one go because we have an hourly quota of Foursquare calls. As mentioned before, we can only make 5000 Foursquare calls an hour. We need to make approximately  40000.  So this notebook does what it can, and saves partial results.  The notebook is rerun with `GET_FRESH_VENUES` and `CREATE_INITAL_DATAFRAME` set as `False` with each run space more than an hour apart until it is complete. I do this with a project job scheduled to repeatedly run. 

In [None]:
for counter, venue_id in enumerate(df.index):
    if counter%100 == 0:
        print("At entry %d and rate remaining is %s" % (counter, client.rate_remaining))
    if np.isnan(df.at[venue_id, 'likes']):
        df.at[venue_id, 'likes'] = client.venues.likes(venue_id)['likes']['count']
    # isnan will error if run on a list.  Instead let's check if it is a list. 
    if isinstance(df.at[venue_id, 'next categories'], list):
        next_cats = []
        result = client.venues.nextvenues(venue_id)
        for venue in result['nextVenues']['items']:
            primary = primary_category(venue)
            if primary is not None:
                next_cats.append({'id': primary['id'],'name':primary['name']})
        df.at[venue_id, 'next categories'] = next_cats
    # Reserve our last 100 calls.  It's good never to be competely out
    if client.rate_remaining is not None and int(client.rate_remaining) < 100:
        break    

In [None]:
if df['next categories'].hasnans:
    print("Not complete will need to be rerun after an hour.")
else:
    print("All data gathered.")

Save to permanent storage. 

In [None]:
project.save_data(file_name = DF_PATH,data = pickle.dumps(df),set_project_asset=True,overwrite=True)


After all data is gathered, the analysis notebook can be run to analyze the data. 