In [1]:
import numpy as np
import pandas as pd
import requests
import zipfile
import time
import math
import folium
from folium import plugins
import os

# Introduction / Business Problem

My partner is a jewellery designer/maker, with a successful shop in one of the trendier parts of London.  She has in the past expressed an interest in opening a second shop, and her usual methodology for this might be to go do some market research.  Rather than simply ask the coolest people we know, I’d like to use data science to identify an appropriate location.   

The basic question then is this: can we work out which are the hippest areas in London, and the areas that are on the verge of becoming hip?  And given that where would be a good place to open a cool jewelry shop?

This approach could more broadly apply to anyone interested in opening an independent store or brand in London, and if the approach is successful might be more broadly applied to any city for which we have the appropriate data.


# Data

We'll break this down into 2 components: quantifying trendiness into clusters, and finding competitive advantage.

To quantify trendiness we'll explore different options, among them identifying whether there are a large number of chain (non-independent) stores in the area, or stores with particular or unusual categories.  We'll look at the frequency and density of specific category types such as record stores and coffee shops.

To finding competitive advantage, we'll look at details of the specific jewelry shops, including rating and price.  Consideration should be given to the number of total shops and restaurants in the area (which would generate more footfall and be a positive influence), as well as other jewellery shops (which would mean more competition and be a negative influence).

The data we'll need includes:

* **Geocoding** - to identify the specific locations
* **Foursquare API** - to identify the qualifying places (jewelry stores, restaurants etc.) and essential details around them


## Geocoding

[Geonames.org](https://www.geonames.org/) is a fantastic free resource that publishes geocoded names and their latitudes/longitudes many countries around the world.  The GeoNames geographical database is available for download free of charge under a creative commons attribution license.  The resource includes an explict accuracy for the latitude/longitude, which will be useful for filtering out bad/inaccurate/duplicate data.

In [2]:
url = 'http://download.geonames.org/export/zip/GB.zip'
zipname = 'GB.zip'
r = requests.get(url)

with open(zipname, 'wb') as f:
    f.write(r.content)

with zipfile.ZipFile(zipname, 'r') as f:
    f.extractall()

columns = ['Countrycode','Postalcode','Placename','Adminname1','Admincode1','Adminname2','Admincode2','Adminname3','Admincode3','Latitude','Longitude','Accuracy']
gb = pd.read_csv('GB.txt', sep='\t', header=None, names=columns)
gb.head()

Unnamed: 0,Countrycode,Postalcode,Placename,Adminname1,Admincode1,Adminname2,Admincode2,Adminname3,Admincode3,Latitude,Longitude,Accuracy
0,GB,DN14,Goole,England,ENG,East Riding of Yorkshire,11609011,,,53.7167,-0.8667,4.0
1,GB,DN14,Pollington,England,ENG,East Riding of Yorkshire,11609011,,,53.6709,-1.0724,4.0
2,GB,DN14,Faxfleet,England,ENG,East Riding of Yorkshire,11609011,,,53.7067,-0.6939,4.0
3,GB,DN14,Laxton,England,ENG,East Riding of Yorkshire,11609011,,,53.7167,-0.8,4.0
4,GB,DN14,Old Goole,England,ENG,East Riding of Yorkshire,11609011,,,53.7125,-0.8909,3.0


**We want to limit ourselves to the area of Greater London.**

Also, 'Accuracy' is defined as below:

```
Accuracy is an integer, the higher the better :
1 : estimated as average from numerically neigbouring postal codes
3 : same postal code, other name
4 : place name from geonames db
6 : postal code area centroid
```

Consequently we'll want to limit ourselves to codes with an Accuracy of greater or equal to 3 (averages of neighboring postal codes isn't useful).  And because we're going to be using the Latitude/Longitude, we'll remove duplicate records of that and use only the first name.  Postalcode will be tricky to use as it bridges Placenames, but we'll want to keep it around for reference and collapse it into a single column joined by a comma.

In [3]:
ldn = gb[(gb['Adminname2']=='Greater London') & (gb['Accuracy'] >= 3.0)].copy()
ldn.drop(['Countrycode', 'Adminname1', 'Admincode1', 'Adminname2', 'Admincode2', 'Adminname3', 'Admincode3', 'Accuracy'], axis='columns', inplace=True)
ldn.drop_duplicates(subset=['Latitude', 'Longitude', 'Postalcode'], inplace=True)
ldn = ldn.groupby(['Latitude','Longitude']).agg({'Placename':'first', 'Postalcode': ','.join }).reset_index()
print('There are {} Places we will evaluate in London'.format(len(ldn.index)))

There are 468 Places we will evaluate in London


Since we're going to use all these locations, let's figure out the average of them to build the centroid of our map.

In [4]:
latitude, longitude = ldn.mean()
print('The geograpical coordinate of London are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of London are 51.50685085470086, -0.12775576923076926.


In [5]:
# create map of London using latitude and longitude values
map_ldn = folium.Map(location=[latitude, longitude], zoom_start=11)

# add markers to map
for lat, lng, place in zip(ldn['Latitude'], ldn['Longitude'], ldn['Placename']):
    label = folium.Popup(place, parse_html=True)
    folium.Circle(
        [lat, lng],
        1000,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_ldn)
   
map_ldn.save('map_ldn.html')
map_ldn

As we're primarily interested in the Central London area, this should be enough.  Note in the map we're outlining a radius of 800m (0.5 miles) around the location center, so we'll have a lot of overlap between areas as we search.  

## Foursquare API

We will retrieve both the places and necessary details from the Foursquare API.  We'll get the precise location and category of the shop, as well as details such as price and rating where necessary.  Some of these calls will be necessarily premium calls, so to keep ourselves within the boundaries of free usage we'll limit them.

When using the search and explore endpoints the Foursquare API will naturally constrain the responses, presumably to keep people from scraping/farming locations.  We'll need to keep this in mind, as it means that our results will not be "complete" for a given radius.

In [6]:
CLIENT_ID = os.environ['FOURSQUARE_CLIENT_ID'] # your Foursquare ID
CLIENT_SECRET = os.environ['FOURSQUARE_CLIENT_SECRET'] # your Foursquare Secret
VERSION = '20200101' # Foursquare API version

We'll begin by retrieving the full list of categories from Foursquare, and identifying which of the categories corresponds to a jewelry store (note that in the UK this is spelled "jewellery".

In [7]:
url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                )
#categories = pd.DataFrame(requests.get(url).json()['response']['categories'])
cat_response = requests.get(url).json()['response']

def get_subcats(catlist, catref):
    if len(catref['categories']) > 0:
        for cat in catref['categories']:
            get_subcats(catlist, cat) 
    else:
        return catlist.append({'id': catref['id'], 'name': catref['name']})
cat_list = []
get_subcats(cat_list, cat_response)

cats = pd.DataFrame(cat_list)

cats[cats['name'].str.contains('Jewel')]

Unnamed: 0,id,name
742,4bf58dd8d48988d111951735,Jewelry Store


We want to focus on a specific types of venues, and the Foursquare API allows us to break things down into category "sections": shops, food, drinks, coffee, arts.  We'll make separate calls for each of these venue sections and collect the relavant details.  Although this may not get us everything, narrowing the search on sections will get us significantly more options.

Then we'll grab all the Jewelry Shop category venues in London exclusively.

We'll start by building a re-usable function that we can call across all of these, that iterates and is reasonably
fault-tolerant.

In [8]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 1000 # define radius

def getNearbyVenues(placenames, latitudes, longitudes, radius=800, categories=[], section=None):
    
    total_calls = 0
    total_results = 0
    total_projected_calls = 0
    venues_list=[]
    
    start = time.time()

    for place, lat, lng in zip(placenames, latitudes, longitudes):

        offset = 0
        max_results = 100
        
        while(offset < max_results):
            
            retry = 0
            url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}&offset={}'.format(
                CLIENT_ID, 
                CLIENT_SECRET, 
                VERSION, 
                lat, 
                lng, 
                radius, 
                LIMIT,
                offset)
            
            if len(categories) > 0:
                url = url + '&categoryId=' + ','.join(categories)
            if section is not None:
                url = url + '&section=' + section
            
            total_calls = total_calls + 1
            results = requests.get(url).json()
    
            #print(results)
            # ok response code     
            if(results['meta']['code'] == 200):
                # return only relevant information for each nearby venue
                for v in results["response"]['groups'][0]['items']:

                    venues_list.append([
                        place, 
                        lat, 
                        lng,
                        v['venue']['id'], 
                        v['venue']['name'], 
                        v['venue']['location']['lat'], 
                        v['venue']['location']['lng'],  
                        v['venue']['location']['distance'],  
                        '|'.join(list(map(lambda x: x['name'], v['venue']['categories'])))])
                    
                max_results = results["response"]['totalResults']

                offset = offset + LIMIT
                    
            elif(retry <=5):
                retry = retry +1
                print("Error in fetch record for [{}]: retry {} in 2s".format(place, retry))
                time.sleep(2)

            else:
                print("Unable to fetch record for [{}]".format(place)) 
                offset = 100000000  # give up after 5 tries
            
        total_results = total_results + max_results
        total_projected_calls = total_projected_calls + math.ceil(max_results/LIMIT)


    nearby_venues = pd.DataFrame(venues_list)
    nearby_venues.columns = ['Placename',
                'Latitude', 
                'Longitude',
                'Venue ID',
                'Venue',
                'Venue Latitude',
                'Venue Longitude',
                'Distance',
                'Venue Categories']

    end = time.time()
    #print('====================')
    #print('Radius        \t{}'.format(radius))
    #print('Calls to API: \t{}'.format(total_calls))
    #print('Calls Proj.:  \t{}'.format(total_projected_calls))
    #print('Places:       \t{}'.format(placenames.size))
    #print('Results:      \t{}'.format(total_results))
    #print('Time:         \t{}s'.format(end - start))
    #print('====================')

    return(nearby_venues)

Now let's do a broad scan of venues by location, then specifically Jewelry Stores by location.

In [9]:

#print("get list of shops:")
shops = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                section='shops'
                        )

#print("get list of coffee:")
coffee = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                section='coffee'
                        )

#print("get list of food:")
food = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                section='food'
                        )

#print("get list of drinks:")
drinks = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                section='drinks'
                        )

#print("get list of arts:")
arts = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                section='arts'
                        )

#print("get list of jewelry shops")
jewels = getNearbyVenues(placenames=ldn['Placename'],
                                latitudes=ldn['Latitude'],
                                longitudes=ldn['Longitude'],
                                radius=radius,
                                categories=['4bf58dd8d48988d111951735']
                        )
                         
                        

get list of shops:
Radius        	1000
Calls to API: 	525
Calls Proj.:  	525
Places:       	468
Results:      	20802
Time:         	283.33800292015076s
get list of coffee:
Radius        	1000
Calls to API: 	516
Calls Proj.:  	505
Places:       	468
Results:      	14136
Time:         	223.36933302879333s
get list of food:
Error in fetch record for [London]: retry 1 in 2s
Radius        	1000
Calls to API: 	600
Calls Proj.:  	594
Places:       	468
Results:      	26765
Time:         	333.6433038711548s
get list of drinks:
Radius        	1000
Calls to API: 	526
Calls Proj.:  	521
Places:       	468
Results:      	14743
Time:         	237.45759797096252s
get list of arts:
Error in fetch record for [London]: retry 1 in 2s
Error in fetch record for [Farringdon]: retry 1 in 2s
Error in fetch record for [Ladbroke Grove]: retry 1 in 2s
Error in fetch record for [Harlesden]: retry 1 in 2s
Error in fetch record for [West Hampstead]: retry 1 in 2s
Error in fetch record for [Gospel Oak]: retry 1 in 

Now let's build unique lists of each of these, and keep metadata associated with the location that is nearest to the searched latitude and longitude.

In [10]:
shops_unique = shops.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

coffee_unique = coffee.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

food_unique = food.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

drinks_unique = drinks.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

arts_unique = arts.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

jewels_unique = jewels.sort_values(by=['Venue ID','Distance']).drop_duplicates(subset=['Venue ID'], keep='first')

As a sample, we'll look at a heatmap for Jewelry shops in London.

In [11]:
# create map of London using latitude and longitude values
map_jewels = folium.Map(location=[latitude, longitude], zoom_start=13)

# add markers to map
for lat, lng, venue in zip(jewels_unique['Venue Latitude'], jewels_unique['Venue Longitude'], jewels_unique['Venue']):
    label = folium.Popup(venue, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=1,
        popup=label,
        fill_color='#3db7e4',
        fill_opacity=0.3,
        parse_html=False).add_to(map_jewels)
   

# plot heatmap
map_jewels.add_child(plugins.HeatMap(jewels_unique[['Venue Latitude', 'Venue Longitude']].values.tolist(), radius=20))
map_jewels.save('map_jewels.html')
map_jewels

Now we have some data, we can do some exploration.