# Final assignment: The Battle of Vegan neighborhoods 

## Week 4 - Description of project to be continued and finalized in week 5
#### A description of the problem and a discussion of the background. (15 marks)

##### Background 
Living in Edinburgh since two years I'm just now getting knowledgeable about where to shop for nice vegan food. There exist several vegan grocery stores within the Edinburgh city limits such as "New Leaf", "The refillery" and "Easter Greens". Veganism is quiet large in this area and most people know it so of course it makes sence to have nice grociery stores. However the vegan population is still <10% and as a results the stores that pop-up are quiet small and don't have marketing budgets in comparison to big brands such as Tesco and Sainsburys which makes it harder to shop as no marketing material is sent out. 

##### Problem formulation 
A likely assumption is that areas with vegan venues often attract or convert vegans why these areas likely have more vegans. My problem formulation is: Can we based on data of different kind of venues cluster the Edinburgh city area to make targeted marketing efforts easier for Vegan grocery stores? 


#### A description of the data and how it will be used to solve the problem. (15 marks)
- I'll use data from foursquare to find venues in the Edinburgh areas which have the keyword "Vegan" 
- I'll use google maps to extract coordinates for mentioned grocery stores that I'm aware of that are very vegan 
- I'll use folium maps to get the interactive geographical data and to make it easy for the observer of the report (a non data scientist grocety store owner) to see the results
- I'll use a CSV file from https://www.freemaptools.com/download/outcode-postcodes/postcode-outcodes.csv that contains outer postcodes to divide Edinburgh in suitable marketing areas
- I'll use pre-existing librarys in Python to conduct the KNN analysis etc

## Week 5 - Description of project to be continued and finalized in week 5

In [5]:
#Install necessary packages
import pandas as pd
import numpy as np
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
import matplotlib.cm as cm
import matplotlib.colors as colors
from sklearn.cluster import KMeans

print('Libraries imported.')

Libraries imported.


In [10]:
#Install conda packages, will take several minutes
#!conda install -c conda-forge geopy --yes # uncomment this line if you haven't completed the Foursquare API lab
#from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
#import folium # map rendering library

#print('Conda installed.')

Solving environment: done

## Package Plan ##

  environment location: /opt/conda/envs/Python36

  added / updated specs: 
    - folium=0.5.0


The following NEW packages will be INSTALLED:

    altair:  4.1.0-py_1 conda-forge
    branca:  0.4.0-py_0 conda-forge
    folium:  0.5.0-py_0 conda-forge
    vincent: 0.4.4-py_1 conda-forge

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
Conda installed.


In [15]:
url='https://www.freemaptools.com/download/outcode-postcodes/postcode-outcodes.csv'
df_area = pd.read_csv(url)
df_area.head()

Unnamed: 0,id,postcode,latitude,longitude
0,2,AB10,57.13514,-2.11731
1,3,AB11,57.13875,-2.09089
2,4,AB12,57.101,-2.1106
3,5,AB13,57.10801,-2.23776
4,6,AB14,57.10076,-2.27073


In [16]:
df_area.drop('id', axis=1, inplace=True) #drop id column as not needed

In [61]:
df_area=df_area[df_area['postcode'].str.contains("EH")] #EH postcodes for Edinburgh
print("Number of postcodes to analyze (#of rows)", df_area.shape)
df_area

Number of postcodes to analyze (#of rows) (57, 3)


Unnamed: 0,postcode,latitude,longitude
776,EH1,55.95243,-3.1884
777,EH10,55.92077,-3.20984
778,EH11,55.93387,-3.24867
779,EH12,55.94262,-3.27137
780,EH13,55.90788,-3.24144
781,EH14,55.90925,-3.28308
782,EH15,55.94686,-3.11136
783,EH16,55.92221,-3.15387
784,EH17,55.90704,-3.14222
785,EH18,55.87667,-3.12215


In [44]:
#Get Edinburgh coordinates
address = 'Edinburgh, UK'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

print('The geograpical coordinate of Edinburgh are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Edinburgh are 55.9533456, -3.1883749.


In [56]:
#adding the coordinates of three known grocery stores in Edinburgh that could be interested in this information 
vegan_g=pd.DataFrame([('New leaf', 55.938676, -3.191008),('Easter Greens', 55.957916, -3.171352),('The Refillery', 55.938479, -3.178054)], columns = ['Shop','latitude','longitude'])
vegan_g

Unnamed: 0,Shop,latitude,longitude
0,New leaf,55.938676,-3.191008
1,Easter Greens,55.957916,-3.171352
2,The Refillery,55.938479,-3.178054


In [151]:
#create map of Edinburgh and plot areas of interest for potential marketing
map_edinburgh = folium.Map(location=[latitude, longitude], zoom_start=13) #zoom 13 gives the more central Edinburgh area where advertising is likely more interesting

#add markers to map
for lat, lng, label in zip(df_area['latitude'], df_area['longitude'], df_area['postcode']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_edinburgh)  
    
for lat, lng, label in zip(vegan_g['latitude'], vegan_g['longitude'], vegan_g['Shop']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=7,
        popup=label,
        color='black',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_edinburgh) 
    
map_edinburgh

In [58]:
#hidden personal Foursquare ID
CLIENT_ID = '####' # your Foursquare ID
CLIENT_SECRET = '######' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version used from before kept to not deal with updating

In [100]:
#defining functions to get near by venues
def getNearbyVenues(names, latitudes, longitudes, radius=300, query='vegan&vegetarian', LIMIT=100): #setting query to vegan to attract tagged results for vegan #setting LIMIT to hundret as SandBox account
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&query={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            query,
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Postcode', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [101]:
edi_venues = getNearbyVenues(names=df_area['postcode'],
                                   latitudes=df_area['latitude'],
                                   longitudes=df_area['longitude']
                                  )

EH1
EH10
EH11
EH12
EH13
EH14
EH15
EH16
EH17
EH18
EH19
EH2
EH20
EH21
EH22
EH23
EH24
EH25
EH26
EH27
EH28
EH29
EH3
EH30
EH31
EH32
EH33
EH34
EH35
EH36
EH37
EH38
EH39
EH4
EH40
EH41
EH42
EH43
EH44
EH45
EH46
EH47
EH48
EH49
EH5
EH51
EH52
EH53
EH54
EH55
EH6
EH7
EH8
EH9
EH91
EH95
EH99


In [102]:
edi_venues.groupby('Postcode').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Postcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
EH1,5,5,5,5,5,5
EH2,6,6,6,6,6,6
EH3,7,7,7,7,7,7
EH6,1,1,1,1,1,1
EH99,1,1,1,1,1,1


In [103]:
edi_venues.head()

Unnamed: 0,Postcode,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,EH1,55.95243,-3.1884,Baked Potato Shop,55.950336,-3.188358,Vegetarian / Vegan Restaurant
1,EH1,55.95243,-3.1884,The Edinburgh Larder,55.95008,-3.186088,Café
2,EH1,55.95243,-3.1884,Dishoom,55.953726,-3.19254,Indian Restaurant
3,EH1,55.95243,-3.1884,Byron,55.950364,-3.187345,Burger Joint
4,EH1,55.95243,-3.1884,PizzaExpress,55.95066,-3.187444,Pizza Place


In [106]:
# one hot encoding
edi_onehot = pd.get_dummies(edi_venues[['Venue Category']], prefix="", prefix_sep="")
# add postcode column back to dataframe
edi_onehot['postcode'] = edi_venues['Postcode'] 

# move neighborhood column to the first column
fixed_columns = [edi_onehot.columns[-1]] + list(edi_onehot.columns[:-1])
edi_onehot = edi_onehot[fixed_columns]


edi_onehot.head()

Unnamed: 0,postcode,American Restaurant,Asian Restaurant,Burger Joint,Café,Indian Restaurant,Mediterranean Restaurant,Pizza Place,Vegetarian / Vegan Restaurant
0,EH1,0,0,0,0,0,0,0,1
1,EH1,0,0,0,1,0,0,0,0
2,EH1,0,0,0,0,1,0,0,0
3,EH1,0,0,1,0,0,0,0,0
4,EH1,0,0,0,0,0,0,1,0


In [139]:
edi_grouped = edi_onehot.groupby('postcode').mean().reset_index()
edi_grouped

Unnamed: 0,postcode,American Restaurant,Asian Restaurant,Burger Joint,Café,Indian Restaurant,Mediterranean Restaurant,Pizza Place,Vegetarian / Vegan Restaurant
0,EH1,0.0,0.0,0.2,0.2,0.2,0.0,0.2,0.2
1,EH2,0.166667,0.0,0.0,0.166667,0.166667,0.166667,0.0,0.333333
2,EH3,0.142857,0.142857,0.0,0.142857,0.0,0.285714,0.0,0.285714
3,EH6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,EH99,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [119]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [124]:
import numpy as np

num_top_venues = 6

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['postcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
postcode_venues_sorted = pd.DataFrame(columns=columns)
postcode_venues_sorted['postcode'] = edi_grouped['postcode']

for ind in np.arange(edi_grouped.shape[0]):
    postcode_venues_sorted.iloc[ind, 1:] = return_most_common_venues(edi_grouped.iloc[ind, :], num_top_venues)

postcode_venues_sorted

Unnamed: 0,postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue
0,EH1,Vegetarian / Vegan Restaurant,Pizza Place,Indian Restaurant,Café,Burger Joint,Mediterranean Restaurant
1,EH2,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Indian Restaurant,Café,American Restaurant,Pizza Place
2,EH3,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Café,Asian Restaurant,American Restaurant,Pizza Place
3,EH6,Vegetarian / Vegan Restaurant,Pizza Place,Mediterranean Restaurant,Indian Restaurant,Café,Burger Joint
4,EH99,Café,Vegetarian / Vegan Restaurant,Pizza Place,Mediterranean Restaurant,Indian Restaurant,Burger Joint


In [148]:
# set number of clusters
kclusters = 3 #enough given sample size
edi_grouped_clustering = edi_grouped.drop('postcode', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(edi_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10] 


array([2, 2, 2, 1, 0], dtype=int32)

In [149]:
# add clustering labels to dataset
edi_venues.drop_duplicates(subset ="Venue", keep = False, inplace = True) 
print(edi_venues.shape)
print(kmeans.labels_.shape)
#edi_venues.insert(0, 'Cluster Labels', kmeans.labels_)

(8, 7)
(5,)


In [150]:
edi_merged = postcode_venues_sorted

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
edi_merged['Classification'] = kmeans.labels_
edi_merged=edi_merged.merge(df_area, on='postcode')
edi_merged

Unnamed: 0,postcode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,Classification,latitude,longitude
0,EH1,Vegetarian / Vegan Restaurant,Pizza Place,Indian Restaurant,Café,Burger Joint,Mediterranean Restaurant,2,55.95243,-3.1884
1,EH2,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Indian Restaurant,Café,American Restaurant,Pizza Place,2,55.95417,-3.19486
2,EH3,Vegetarian / Vegan Restaurant,Mediterranean Restaurant,Café,Asian Restaurant,American Restaurant,Pizza Place,2,55.95412,-3.19967
3,EH6,Vegetarian / Vegan Restaurant,Pizza Place,Mediterranean Restaurant,Indian Restaurant,Café,Burger Joint,1,55.97144,-3.17456
4,EH99,Café,Vegetarian / Vegan Restaurant,Pizza Place,Mediterranean Restaurant,Indian Restaurant,Burger Joint,0,55.955591,-3.17611


In [184]:
edi_venues

Unnamed: 0,Postcode,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,EH1,55.95243,-3.1884,Baked Potato Shop,55.950336,-3.188358,Vegetarian / Vegan Restaurant
1,EH1,55.95243,-3.1884,The Edinburgh Larder,55.95008,-3.186088,Café
3,EH1,55.95243,-3.1884,Byron,55.950364,-3.187345,Burger Joint
4,EH1,55.95243,-3.1884,PizzaExpress,55.95066,-3.187444,Pizza Place
15,EH3,55.95412,-3.19967,Meze Meze,55.952483,-3.19946,Mediterranean Restaurant
16,EH3,55.95412,-3.19967,Wee Buddha,55.955999,-3.202771,Asian Restaurant
18,EH6,55.97144,-3.17456,Harmonium,55.973707,-3.173006,Vegetarian / Vegan Restaurant
19,EH99,55.955591,-3.17611,Century General Store & Cafe,55.956678,-3.171944,Café


In [198]:
# create map with areas, venues and shops
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=13)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(edi_merged['latitude'], edi_merged['longitude'], edi_merged['postcode'], edi_merged['Classification']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=7,
        popup=label,
        color='black',
        fill=True,
        fill_color=rainbow[cluster-3],
        fill_opacity=0.7).add_to(map_clusters)

for lat, lng, label in zip(vegan_g['latitude'], vegan_g['longitude'], vegan_g['Shop']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=15,
        popup=label,
        color='black',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_clusters)
    
map_clusters

End of code 

#### We can see that marketing efforts should be directed towards EH1, EH2 and EH3 where we acctually have a significant amount of tagged vegan resturants