# Location Recommendation System for New Businesses in Budapest, Hungary &ndash; Study
## Introduction
In the project below I try to help people who want to found a new business or expand their existing one in Budapest, Hungary. 

One important element of starting an successful business is finding the best location of it. It is well known that similar businesses tend to gravitate to each other, creating clusters of them all around the city. If you open your new shop close to your competitors, you have higher chance that people will find and try it just by chance.

The user of the recommendation system can enter a category as an input, e.g. "Coffee Shop", "Restaurant", "Bookstore", etc. and the system will determine the best location of a new venue based on the distribution of the already existing ones. 

In this notebook I explain in detail how do we fetch and process the data, and how the location recommendation is generated.

## Data
The data will be taken from Foursquare. I will search for the kind of business in the central part of Budapest and collect their location information. What we will need for the analysis is only the geographical coordinates of the venues. 

The list of categories, that the user can use, is also taken from Foursquare

In order to overcome the problem that Foursquare returns only 50 venues at a time, I divided the area of interest into 24 equal-sized, rectangular regions.

### Install and import of necessary modules

In [1]:
#!pip install geopy
import requests
import json
import pandas as pd
#from geopy.geocoders import Nominatim
import folium
import numpy as np
from sklearn.cluster import MeanShift
import matplotlib.cm as cm
import matplotlib.colors as colors

Collecting geopy
[?25l  Downloading https://files.pythonhosted.org/packages/53/fc/3d1b47e8e82ea12c25203929efb1b964918a77067a874b2c7631e2ec35ec/geopy-1.21.0-py2.py3-none-any.whl (104kB)
[K     |████████████████████████████████| 112kB 24.3MB/s eta 0:00:01
[?25hCollecting geographiclib<2,>=1.49 (from geopy)
  Downloading https://files.pythonhosted.org/packages/8b/62/26ec95a98ba64299163199e95ad1b0e34ad3f4e176e221c40245f211e425/geographiclib-1.50-py3-none-any.whl
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-1.50 geopy-1.21.0


### Set some constant values

In [2]:
CLIENT_ID = 'ZBC2EWETE1R2DGRBAWJ1YGOOUBCTXCOZTJS4JHQ0V1MBJPJU' # your Foursquare ID
CLIENT_SECRET = 'UGTY5QSD0CVQQC1DFIKRZULRMNVZJSFAPKVHWKSZXR0ADP5Z' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 500
BP_LAT = 47.495
BP_LNG = 19.05

### Set up a matrix of coordinates for the region split

In [3]:
lat, lng = np.mgrid[47.475:47.516:0.01, 19.02:19.081:0.01]
ll = np.mgrid[47.475:47.516:0.01, 19.02:19.081:0.01].reshape(2,-1).T
lat

array([[47.475, 47.475, 47.475, 47.475, 47.475, 47.475, 47.475],
       [47.485, 47.485, 47.485, 47.485, 47.485, 47.485, 47.485],
       [47.495, 47.495, 47.495, 47.495, 47.495, 47.495, 47.495],
       [47.505, 47.505, 47.505, 47.505, 47.505, 47.505, 47.505],
       [47.515, 47.515, 47.515, 47.515, 47.515, 47.515, 47.515]])

In [9]:
lng

array([[19.02, 19.03, 19.04, 19.05, 19.06, 19.07, 19.08],
       [19.02, 19.03, 19.04, 19.05, 19.06, 19.07, 19.08],
       [19.02, 19.03, 19.04, 19.05, 19.06, 19.07, 19.08],
       [19.02, 19.03, 19.04, 19.05, 19.06, 19.07, 19.08],
       [19.02, 19.03, 19.04, 19.05, 19.06, 19.07, 19.08]])

### Here is the visualization of the 24 regions

In [4]:
map = folium.Map(location=[BP_LAT, BP_LNG], zoom_start=13)
    
for lt, ln in zip(lat, lng):
    folium.PolyLine([(lt[0], ln[0]), (lt[-1], ln[-1])]).add_to(map)  # horizontal lines
    
for lt, ln in zip(lat.T, lng.T):
    folium.PolyLine([(lt[0], ln[0]), (lt[-1], ln[-1])]).add_to(map)  # vertical lines
    
map

### Once the regions are specified and we know the bordering coordinates of all of them, we can fetch the necessary information from Foursquare

But before of that, we need the list of categories, that we also fetch from Foursquare. We make a single API call, and we fill up  a Pandas dataframe with the results.

In [5]:
url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET,
            VERSION)

results = requests.get(url).json()
major_category_list = results['response']['categories']
df_categories = pd.DataFrame()

for major_category in major_category_list:
    for minor_category in major_category['categories']:
        df_categories = df_categories.append({
            "name": minor_category['name'],
            "id": minor_category['id']
        }, ignore_index=True)
        
print("There are {} different categories in the list.".format(df_categories.shape[0]))

There are 459 different categories in the list.


Now we generate the 24 url-s for our search and store them in a list. The categoryId is taken from the dataframe that we just created above. You can choose whatever category you want from [Foursquare's category list](https://developer.foursquare.com/docs/build-with-foursquare/categories/). 

In [6]:
intent = "browse"
category = "Bookstore"

try:
    categoryId = df_categories[df_categories['name'] == category].iloc[0,0]
except:
    print("Sorry, there is no such category: {}".format(category))
    raise ValueError("Sorry, there is not such a category")
    
url_list = []
for i in range(lat.shape[0] - 1):
    for j in range(lat.shape[1] - 1):
        sw_lat = lat[i][j]
        sw_lng = lng[i][j]
        ne_lat = lat[i+1][j+1]
        ne_lng = lng[i+1][j+1]
        sw = "{:.3f},{:.3f}".format(sw_lat, sw_lng)
        ne = "{:.3f},{:.3f}".format(ne_lat, ne_lng)

        url_list.append('https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&v={}&limit={}&intent={}&sw={}&ne={}&categoryId={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET,
            VERSION,
            LIMIT,
            intent,
            sw,
            ne,
            categoryId))

In this cell, we call the Foursquare API with the 24 url-s generated above, and we store the coordinates of the resulted venues in the variable coordlist.

In [7]:
coordlist = []

for url in url_list:
    results = requests.get(url).json()
    venues = results['response']['venues']
    for venue in venues:
        lat = venue['location']['lat']
        lng = venue['location']['lng']
        coordlist.append([lat, lng])
        
num_results = len(coordlist)
print("We have found {} {}s.".format(len(coordlist), category))

We have found 228 Bookstores.


### Let's see their distribution on a map.

In [8]:
for coord in coordlist:
    folium.CircleMarker(
        coord,
        radius=4,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map)
    
map

## Methodology: Clustering with Mean Shift algorithm

What we are going to do now is clustering the results with the Mean Shift algorithm. An important feature of Mean Shift is that you do not have to specify the number of clusters in advance. Instead, you specify the so called bandwidth. The smaller the bandwidth, the more clusters will be identified with fewer members that are closer to each other.

We try to find the clusters in which the venues are the closest to each other. Below we set the bandwidth to a relatively small value, that will result in numerous clusters, especially, if there are many results for the given category. By experience, a bandwidth of 0.002 provides good results for most of the categories.

Once the clustering is done, we choose the 5 clusters with the most members, and that gives us the recommended locations of our new venue.

In [9]:
BANDWIDTH = 0.002
number_of_reccommendations = 5

X = np.array(coordlist)
clust = MeanShift(BANDWIDTH).fit(X)
print("{} clusters were identified by Mean Shift algorithm".format(len(clust.cluster_centers_)))


77 clusters were identified by Mean Shift algorithm


### Now we select the ones with the most members ...

In [10]:
unique, counts = np.unique(clust.labels_, return_counts=True)
good_clusters = counts.argsort()[-number_of_reccommendations:][::-1]
good_clusters

array([0, 4, 7, 5, 1])

### ... and we identify the venue locations that belong to the best clusters.

In [11]:
X_good = X[[label in good_clusters for label in clust.labels_]]
X_good_labels = MeanShift(BANDWIDTH).fit(X_good).labels_

### Finally, we indicate the best clusters on the map.
The black markers indicate the cluster centers, and the colored markers show the already existing venues that belong to the best clusters. The area which is marked by the colored markers is the recommended location for a new venue of the same category.

In [12]:
map = folium.Map(location=[BP_LAT, BP_LNG], zoom_start=13)

# set color scheme for the clusters
x = np.arange(number_of_reccommendations+1)
ys = [i + x + (i*x)**2 for i in range(number_of_reccommendations+1)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

for coord, cluster in zip(X_good, X_good_labels):
    folium.CircleMarker(
        coord,
        radius=5,
        color=rainbow[cluster+1],
        fill=True,
        fill_color=rainbow[cluster+1],
        fill_opacity=0.7,
        parse_html=False).add_to(map)
    
for coord in clust.cluster_centers_[good_clusters]:
    folium.CircleMarker(
        coord,
        radius=4,
        color='#0d0d0c',
        fill=True,
        fill_color='#52524e',
        fill_opacity=0.7,
        parse_html=False).add_to(map)
    
map