# How to choose a place for your new coffe-shop?

## Introduction/Business Problem 


This document is adressed to those businessmens who want to open a coffee shop. There are many difficulties on this way and choosing the best place for future coffe shop is one of the most common. 

Can machine learning be useful to complete this non-trivial task? Is it possible for business owner to get such insights, as:

- Get a visual representation of venues distribution in the selected city.
- See how they are grouped into clusters in places of particular popularity
- Determine the centers of these clusters.
- For each cluster you calculate how many coffe-shops are already there.

In this document you will see how easy it is.

## Data 

In this document we will use Foursquare location data. Foursquare location data is data describing places and venues, such as their geographical location, their category, working hours, full address, and so on, such that for a given location given in the form of its geographical coordinates (or latitude and longitude values) one is able to determine what types of venues exist within a defined radius from that location.  

Using the Foursquare API, we can search for specific type of venues or stores around a given location. And for a given location you will be able to tell how many of each venue category exist and how each surrounding venue is reviewed by other people.

As parameters for building the model, geographical coordinates of Brest city venues were selected within a radius of 3000 meters from its center. Such a distance made it possible to cover the entire historical center and the surrounding areas, potentially interesting for placing coffee shops.

### Lets define our city location

In [92]:
city = 'Brest'
Latitude = 52.0975
Longitude = 23.6877
radius = 3000

In [93]:
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library to handle data in a vectorized manner

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.pyplot as plt # plotting library
# backend for rendering plots within the browser
%matplotlib inline 

import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
!pip install folium
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


In [94]:
# create map of New York using latitude and longitude values
brest_map = folium.Map(location=[Latitude, Longitude], zoom_start=12)

# add markers to map
#for lat, lng, borough, neighborhood in zip(brest['Latitude'], toronto['Longitude'], toronto['Borough'], toronto['Neighborhood']):
#    label = '{}, {}'.format(neighborhood, borough)
#    label = folium.Popup(label, parse_html=True)
#    folium.CircleMarker(
#        [lat, lng],
#        radius=5,
#        popup=label,
#        color='blue',
#        fill=True,
#        fill_color='#3186cc',
#        fill_opacity=0.7,
#        parse_html=False).add_to(toronto_map)  
    
brest_map

### Let's connect Foresquare account:

In [95]:
CLIENT_ID = '5XZGFWJL4ICVX2YTUO3HOO5OA3FWJGKZRHURFC42ZLSY0I4O' # your Foursquare ID
CLIENT_SECRET = '2RMWAEG3MXFRUBPMYN4ZE5KRBJIVXOHEV5YR5HCUL55SNJTT' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 1000
print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 5XZGFWJL4ICVX2YTUO3HOO5OA3FWJGKZRHURFC42ZLSY0I4O
CLIENT_SECRET:2RMWAEG3MXFRUBPMYN4ZE5KRBJIVXOHEV5YR5HCUL55SNJTT


### Now we are ready to get venues list:

In [96]:
venues_list=[]
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
    CLIENT_ID, 
    CLIENT_SECRET, 
    VERSION, 
    Latitude, 
    Longitude, 
    radius, 
    LIMIT)
            
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
        
# return only relevant information for each nearby venue
venues_list.append([( 
    v['venue']['name'], 
    v['venue']['location']['lat'], 
    v['venue']['location']['lng'],  
    v['venue']['categories'][0]['name']) for v in results])


In [97]:
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']

### And this is our data

In [98]:
print(nearby_venues.shape)
nearby_venues.head()

(54, 4)


Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Hotel Hermitage,52.093937,23.681345,Hotel
1,Paragraph,52.097556,23.693544,Coffee Shop
2,Улица Советская,52.091322,23.694616,Road
3,Кофейный Бар И Магазин «Cafés La Brasileña»,52.091152,23.68487,Coffee Shop
4,Times Cafe,52.094807,23.691402,Café


Using the parameters indicated above, we were able to obtain information about 54 places of rest. This is a fairly small amount of data, since Forsquare is not very popular in Belarus. Nevertheless, this study will be useful as additional information for starting a new business.

In [99]:
nearby_venues_grouped = nearby_venues.groupby('Venue Category', sort=False).count()
nearby_venues_grouped

Unnamed: 0_level_0,Venue,Venue Latitude,Venue Longitude
Venue Category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hotel,3,3,3
Coffee Shop,8,8,8
Road,1,1,1
Café,3,3,3
Garden,1,1,1
Bookstore,1,1,1
Park,3,3,3
Mediterranean Restaurant,1,1,1
National Park,1,1,1
Hotel Bar,1,1,1


### Creating a map using latitude and longitude values and visualizing venues

In [100]:
# create map using latitude and longitude values
brest_map = folium.Map(location=[Latitude, Longitude], zoom_start=12)

# add markers to map
for lat, lng, category, venue in zip(nearby_venues['Venue Latitude'], nearby_venues['Venue Longitude'], nearby_venues['Venue Category'], nearby_venues['Venue']):
    label = '{}, {}'.format(category, venue)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(brest_map)  
    
brest_map

### Let's define our features

In [101]:
cluster_dataset = nearby_venues #[nearby_venues['Venue Category'] != 'Coffee Shop']
cluster_features = cluster_dataset.drop(['Venue Category', 'Venue'], axis=1)
print(cluster_features.shape)
cluster_features.head()

(54, 2)


Unnamed: 0,Venue Latitude,Venue Longitude
0,52.093937,23.681345
1,52.097556,23.693544
2,52.091322,23.694616
3,52.091152,23.68487
4,52.094807,23.691402


### Building a model

In [102]:
from sklearn.cluster import KMeans 
num_clusters = 10

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=12)
k_means.fit(cluster_features)
k_means_labels = k_means.labels_
k_means_cluster_centers = k_means.cluster_centers_
print(k_means_labels.shape)
print(k_means_labels)

(54,)
[4 0 8 4 0 0 0 9 0 0 0 8 4 8 0 8 0 8 8 0 4 9 1 8 8 1 8 4 9 0 0 8 4 8 8 0 1
 2 1 1 8 3 1 1 1 2 3 2 6 2 7 7 5 5]


In [103]:
k_means_cluster_centers

array([[52.09546561, 23.69162233],
       [52.08379854, 23.65757312],
       [52.08684467, 23.70750531],
       [52.11549388, 23.6872844 ],
       [52.09160541, 23.68358166],
       [52.0740375 , 23.706963  ],
       [52.11213401, 23.66670504],
       [52.09199177, 23.72532728],
       [52.08958273, 23.69413765],
       [52.0908921 , 23.67407835]])

In [104]:
centers_df = pd.DataFrame(k_means_cluster_centers)
centers_df.reset_index(inplace = True)
centers_df.columns = ('Labels', 'Center Latitude', 'Center Longitude')
centers_df

Unnamed: 0,Labels,Center Latitude,Center Longitude
0,0,52.095466,23.691622
1,1,52.083799,23.657573
2,2,52.086845,23.707505
3,3,52.115494,23.687284
4,4,52.091605,23.683582
5,5,52.074038,23.706963
6,6,52.112134,23.666705
7,7,52.091992,23.725327
8,8,52.089583,23.694138
9,9,52.090892,23.674078


In [105]:
cluster_dataset["Labels"] = k_means_labels
cluster_dataset = cluster_dataset.merge(centers_df, left_on='Labels', right_on='Labels')
cluster_dataset.head()

Unnamed: 0,Venue,Venue Latitude,Venue Longitude,Venue Category,Labels,Center Latitude,Center Longitude
0,Hotel Hermitage,52.093937,23.681345,Hotel,4,52.091605,23.683582
1,Кофейный Бар И Магазин «Cafés La Brasileña»,52.091152,23.68487,Coffee Shop,4,52.091605,23.683582
2,Hermitage Lounge,52.093912,23.681306,Hotel Bar,4,52.091605,23.683582
3,Площадь Ленина,52.094007,23.685053,Plaza,4,52.091605,23.683582
4,Бассейн Нептун,52.087953,23.683238,Pool,4,52.091605,23.683582


### Visualizing the Resulting Clusters
So now that we have the labels and their centers data generated and the KMeans model initialized, let's plot them and see what the clusters look like.

In [106]:
# create map of New York using latitude and longitude values
brest_map_labeled = folium.Map(location=[Latitude, Longitude], zoom_start=12)
colors = ['blue','fuchsia','red','green','yellow','orange','khaki','pink','violet','brown','grey','white','purple','beige','firebrick','coral','azure']

# add markers to map
for lat, lng, category, venue, lab, clat, clng in zip(cluster_dataset['Venue Latitude'], cluster_dataset['Venue Longitude'], cluster_dataset['Venue Category'], cluster_dataset['Venue'], cluster_dataset['Labels'], cluster_dataset['Center Latitude'], cluster_dataset['Center Longitude']):
    label = '{}, {}, {}'.format(category, venue, lab)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='black',
        fill=True,
        fill_color=colors[lab],
        fill_opacity=0.9,
        parse_html=False).add_to(brest_map_labeled)  
    folium.CircleMarker(
        [clat, clng],
        radius=20,
        popup=lab,
        color=colors[lab],
        fill=False,
        fill_opacity=0.01,
        parse_html=False).add_to(brest_map_labeled) 
    
brest_map_labeled

### Now we will calculate total number of venues in the cluster, number of coffe shops and cofe shop - venue ratio, wich will tell us the level of competition

In [107]:
coffeshop = cluster_dataset[cluster_dataset['Venue Category'] == 'Coffee Shop']
coffeshop = coffeshop[['Labels', 'Venue Category']].groupby(['Labels']).count()
coffeshop.rename(columns={'Venue Category': 'Coffee Shop'}, inplace=True)
other = cluster_dataset[cluster_dataset['Venue Category'] != 'Coffee Shop']
other = other[['Labels', 'Venue Category']].groupby(['Labels']).count()
other.rename(columns={'Venue Category': 'Other'}, inplace=True)
coffeshop = coffeshop.merge(other, how = 'outer', on = 'Labels').fillna(0)
coffeshop['Coffee Shop'] = coffeshop['Coffee Shop'].astype(int)
coffeshop['Total'] = (coffeshop['Coffee Shop']+coffeshop['Other']).astype(int)
coffeshop['Ratio'] = round(coffeshop['Coffee Shop']/coffeshop['Other'],2)
coffeshop = coffeshop.sort_values(by = 'Labels', ascending=True)
coffeshop

Unnamed: 0_level_0,Coffee Shop,Other,Total,Ratio
Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,3,10,13,0.3
1,0,8,8,0.0
2,0,4,4,0.0
3,0,2,2,0.0
4,1,5,6,0.2
5,0,2,2,0.0
6,0,1,1,0.0
7,0,2,2,0.0
8,4,9,13,0.44
9,0,3,3,0.0


### Analyze Each Neighborhood


In [108]:
# one hot encoding
cluster_onehot = pd.get_dummies(cluster_dataset[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
cluster_onehot['Labels'] = cluster_dataset['Labels'] 

# move neighborhood column to the first column
fixed_columns = [cluster_onehot.columns[-1]] + list(cluster_onehot.columns[:-1])
cluster_onehot = cluster_onehot[fixed_columns]

cluster_onehot.head()

Unnamed: 0,Labels,Asian Restaurant,Belarusian Restaurant,Bookstore,Bubble Tea Shop,Bus Stop,Café,Church,Coffee Shop,Eastern European Restaurant,Garden,Gourmet Shop,Gym,Gym / Fitness Center,Historic Site,History Museum,Hookah Bar,Hostel,Hotel,Hotel Bar,Italian Restaurant,Mediterranean Restaurant,Modern European Restaurant,Monument / Landmark,Museum,National Park,Other Great Outdoors,Park,Plaza,Pool,Road,Shoe Store,Shopping Mall,Steakhouse
0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,4,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


Next, let's group rows by neighborhood and by taking the mean of the frequency of occurrence of each category


In [109]:
cluster_grouped = cluster_onehot.groupby('Labels').mean().reset_index()
cluster_grouped.merge(coffeshop, on = 'Labels')
cluster_grouped

Unnamed: 0,Labels,Asian Restaurant,Belarusian Restaurant,Bookstore,Bubble Tea Shop,Bus Stop,Café,Church,Coffee Shop,Eastern European Restaurant,Garden,Gourmet Shop,Gym,Gym / Fitness Center,Historic Site,History Museum,Hookah Bar,Hostel,Hotel,Hotel Bar,Italian Restaurant,Mediterranean Restaurant,Modern European Restaurant,Monument / Landmark,Museum,National Park,Other Great Outdoors,Park,Plaza,Pool,Road,Shoe Store,Shopping Mall,Steakhouse
0,0,0.076923,0.0,0.076923,0.0,0.0,0.076923,0.0,0.230769,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.076923,0.076923,0.0,0.0,0.076923,0.076923,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.0
1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.125,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0,0.25,0.0,0.0
3,3,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.166667,0.0,0.0,0.0,0.0
5,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0
6,6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7,7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,8,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.307692,0.0,0.0,0.0,0.076923,0.076923,0.0,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.0,0.0,0.0,0.0,0.076923,0.0,0.076923,0.0,0.0,0.076923
9,9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
def cluster_info(cluster):
    return coffeshop.reset_index().iloc[cluster+0:cluster+1,1:5]


In [111]:
print(coffeshop.reset_index().iloc[0:1,1:5])

   Coffee Shop  Other  Total  Ratio
0            3     10     13    0.3


### Let's print the final report
For each neighborhood we print top 5 most common venues


In [112]:
num_top_venues = 5
 
for cluster in cluster_grouped['Labels']:
    print("             Сluster",cluster)
    print("************************************")
    temp = cluster_grouped[cluster_grouped['Labels'] == cluster].T.reset_index()
    temp.columns = ['venue','freq']
    temp = temp.iloc[1:]
    temp['freq'] = temp['freq'].astype(float)
    temp = temp.round({'freq': 2})
    print(temp.sort_values('freq', ascending=False).reset_index(drop=True).head(num_top_venues))
    print("------------------------------------")
    print(cluster_info(cluster=cluster))
    print("************************************")
    print('\n')

             Сluster 0
************************************
              venue  freq
0       Coffee Shop  0.23
1  Asian Restaurant  0.08
2            Garden  0.08
3              Park  0.08
4     National Park  0.08
------------------------------------
   Coffee Shop  Other  Total  Ratio
0            3     10     13    0.3
************************************


             Сluster 1
************************************
                 venue  freq
0        Historic Site  0.50
1       History Museum  0.25
2               Church  0.12
3  Monument / Landmark  0.12
4     Asian Restaurant  0.00
------------------------------------
   Coffee Shop  Other  Total  Ratio
1            0      8      8    0.0
************************************


             Сluster 2
************************************
                  venue  freq
0          Gourmet Shop  0.25
1            Shoe Store  0.25
2                  Café  0.25
3  Other Great Outdoors  0.25
4      Asian Restaurant  0.00
--------------