# Capstone Project - The Battle of the Neighborhoods
### Applied Data Science Capstone by IBM/Coursera

## Introduction

Suppose you want to open a new gym, where should it be located? In this project we will go through the process of finding a location that is likely to be the optimal place to start.

In this scenario we want to open a new gym in one of the 20 biggest cities in The Netherlands. Ideally we would want a neighbourhood that is not already saturated with existing gyms, but is similar to other neighbourhoods with succesful gyms.


We will use publicly available data about the dutch cities and their neighbourhoods so we can employ various forms of data analytics and visualizations to help us in our search.

## Data

The metrics we have decided to use to find out new location are:
* Population density per existing gym in a city
* Amount of pre-existing gyms per neighbourhood
* Similarity of neighbourhood to other neighbourhoods with gyms

Following data sources will be needed to extract/generate the required information:
* Population statistics will be obtained from wikipedia
* Name and location of neighbourhoods in each city will be obtained by a combination of Google Maps API geocoding and the Postal code information per city
* Number of gyms and their location in every neighborhood will be obtained using the Foursquare API


In [1]:
import pandas as pd
import numpy as np

import requests
import bs4

In [2]:
response = requests.get('https://en.wikipedia.org/wiki/Template:Largest_cities_of_the_Netherlands')
html = bs4.BeautifulSoup(response.text, 'html.parser')
table = html.find('table',{'class':'navbox'})

We get our list of cities from the wikipedia page for the 20 largest cities of the netherlands. The table is only 10 rows because the data is spread over double the columns. We'll need to do some cleaning to get a nice dataset with no duplicate columns and exactly 20 rows.

In [3]:
#scrape the wikipedia table for its entries and combine those in the dataframe.
city_data = []
cycle_i = 0
cycle_j = 0
for i in table.findAll('tr'):
    row_data=[]
    if cycle_i <2: #skip some empty rows
        cycle_i += 1
        continue
    for j in i.findAll('td'):
        if cycle_i == 2 and (cycle_j == 0 or cycle_j == 9): #some image subscript is in the table that we can skip as well
            cycle_j += 1
            continue
        row_data.append(j.text.strip())
        cycle_j += 1
    cycle_i += 1
    city_data.append(row_data)
df =pd.DataFrame(city_data)

In [4]:
df1 = df.iloc[:,0:4].reset_index(drop=True).dropna()
df2 = df.iloc[:,4::].reset_index(drop=True).dropna()

In [5]:
df1.columns=['0','1','2','3']
df2.columns=['0','1','2','3']


Here we finally have a nice table. The Hague is renamed to "'s-Gravenhage". The former is the english name and the later is the dutch name is used the most in online databases.

In [70]:
df = df1.append(df2,sort=False)
df.columns = ['index','City','Region','Population']
df['City'].replace('The Hague',"'s-Gravenhage", inplace=True) 


In [7]:
from geopy.geocoders import Nominatim
geo_locator = Nominatim(user_agent ='Nederland', timeout=2)

In [8]:
import requests
import bs4

We use the Geonames API to retrieve all the postal codes registered to each city. Here we can also find the latitudes and longitudes for each location. 

In [9]:
postalcodes = []
latitudes = []
longitudes = []
places = []
municitality = []

for i in df['City']:
    response = requests.get('http://api.geonames.org/postalCodeSearch?username=willemw&maxrows=200&country=NL&isReduced=True&style=long&placename=' + i)
    html = bs4.BeautifulSoup(response.text, 'html.parser')
    codes = html.find_all('postalcode')
    lats = html.find_all('lat')
    longs = html.find_all('lng')
    municipalities = html.find_all('adminname2')
    
    for j in range(len(codes)):
        postalcodes.append(codes[j].text)
        latitudes.append(lats[j].text)
        longitudes.append(longs[j].text)
        municitality.append(municipalities[j].text)

        places.append(i)
    



In [10]:
locations = pd.DataFrame({'Postal Code':postalcodes, 
                          'Latitude':latitudes,
                          'Longitude':longitudes,
                          'Place' :places,
                          'Municipality' : municitality})


Now we have a neat dataframe with our location data. We use the municipality data to distinguish between locations that are in Utrecht or Groningen the city and Utrecht or Groningen the province, as we only want the former.


In [11]:
locations = locations[locations['Municipality'] == locations['Place']]

In [12]:
locations.groupby('Place').count()['Postal Code'].sort_values(ascending=False)

Place
Amsterdam           81
Rotterdam           79
's-Gravenhage       61
Utrecht             46
Almere              43
Groningen           43
Eindhoven           34
Haarlemmermeer      33
Apeldoorn           33
's-Hertogenbosch    30
Tilburg             28
Arnhem              26
Zwolle              25
Breda               25
Nijmegen            25
Zaanstad            23
Enschede            23
Haarlem             20
Amersfoort          18
Leiden              16
Name: Postal Code, dtype: int64

In [13]:
import folium

Let's create a map of the netherlands that shows all the locations we've gathered so far. We use a marker radius of 500 meters, the same we'll use in our foursquare search. As you can see there is some overlap, which we'll take care of later, and some empty spots.

In [14]:
# create map of The Netherlands
map_netherlands = folium.Map(location=[52.40451, 4.89127], zoom_start=10)

# add location markers to map
for lat, lng, postalcode, municipality in zip(locations['Latitude'], locations['Longitude'], locations['Postal Code'], locations['Municipality']):
        label = '{},{}'.format(postalcode, municipality)
        label = folium.Popup(label, parse_html=True)
        folium.Circle(
        [lat, lng],
        popup=label,
        radius=500,
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_netherlands)  
    
map_netherlands

Our foursquare credentials

In [15]:
CLIENT_ID = 'EUNARJ2QDR1ATEPFQZ0BJ1TDIWWAML1MUWAV4ST2CKBTJWY4' # your Foursquare ID
CLIENT_SECRET = 'HXS1T3XMFHZV3ZDZD5MO1EAGJR1MHDXYTL3DKMM3Q4IMNC1W' # your Foursquare Secret
VERSION = '20200605' # Foursquare API version

Script to search foursquare for all venues near our selected locations

In [16]:
def getNearbyVenues(postalcode,city, latitudes, longitudes, radius=500):
    i = float(1)
    venues_list=[]
    for postalcode, city, lat, lng in zip(postalcode, city, latitudes, longitudes):
        print( i/(len(locations)/100) ) #show progress in output
        i +=1    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
            # return only relevant information for each nearby venue
            venues_list.append([(
                city,
                postalcode, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

        except:
            continue    
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                'Postal Code', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']
    return(nearby_venues)

In [17]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
neighbourhood_venues = getNearbyVenues(postalcode=locations['Postal Code'],
                                   city=locations['Place'],
                                   latitudes=locations['Latitude'],
                                   longitudes=locations['Longitude']
                                  )

0.1404494382022472
0.2808988764044944
0.42134831460674155
0.5617977528089888
0.7022471910112359
0.8426966292134831
0.9831460674157303
1.1235955056179776
1.2640449438202248
1.4044943820224718
1.544943820224719
1.6853932584269662
1.8258426966292134
1.9662921348314606
2.106741573033708
2.247191011235955
2.3876404494382024
2.5280898876404496
2.6685393258426964
2.8089887640449436
2.9494382022471908
3.089887640449438
3.230337078651685
3.3707865168539324
3.5112359550561796
3.651685393258427
3.792134831460674
3.932584269662921
4.073033707865169
4.213483146067416
4.353932584269663
4.49438202247191
4.634831460674158
4.775280898876405
4.915730337078652
5.056179775280899
5.1966292134831455
5.337078651685393
5.47752808988764
5.617977528089887
5.758426966292134
5.8988764044943816
6.039325842696629
6.179775280898876
6.320224719101123
6.46067415730337
6.601123595505618
6.741573033707865
6.882022471910112
7.022471910112359
7.162921348314606
7.303370786516854
7.443820224719101
7.584269662921348
7.724719

Let's take a quick look at the venues we've gathered. From a quick scan it looks like the bigger cities have the most venues so that looks right.

In [18]:
neighbourhood_venues.groupby('City').size().sort_values(ascending=False)

City
Amsterdam           2454
Rotterdam           1401
's-Gravenhage       1146
Utrecht              728
Groningen            478
Eindhoven            424
Haarlem              407
Leiden               327
Breda                287
Almere               287
Arnhem               282
Nijmegen             274
Zwolle               258
's-Hertogenbosch     257
Enschede             250
Apeldoorn            237
Haarlemmermeer       236
Tilburg              218
Amersfoort           202
Zaanstad             199
dtype: int64

Here we query foursquare again, this time we only want the information for gyms and fitness centre (determined by category ID)

In [19]:
def getNearbyGyms(postalcode,city, latitudes, longitudes, radius=500):
    i = float(1)
    venues_list=[]
    for postalcode, city, lat, lng in zip(postalcode, city, latitudes, longitudes):
        print( i/(len(locations)/100) )
        i +=1    
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&categoryId=4bf58dd8d48988d175941735&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']
        
        
            # return only relevant information for each nearby venue
            venues_list.append([(
                city,
                postalcode, 
                lat, 
                lng, 
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])

        except:
            continue    
        
    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City',
                'Postal Code', 
                'Neighborhood Latitude', 
                'Neighborhood Longitude', 
                'Venue', 
                'Venue Latitude', 
                'Venue Longitude', 
                'Venue Category']
    return(nearby_venues)

In [20]:
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius
neighbourhood_gyms = getNearbyGyms(postalcode=locations['Postal Code'],
                                   city=locations['Place'],
                                   latitudes=locations['Latitude'],
                                   longitudes=locations['Longitude']
                                  )

0.1404494382022472
0.2808988764044944
0.42134831460674155
0.5617977528089888
0.7022471910112359
0.8426966292134831
0.9831460674157303
1.1235955056179776
1.2640449438202248
1.4044943820224718
1.544943820224719
1.6853932584269662
1.8258426966292134
1.9662921348314606
2.106741573033708
2.247191011235955
2.3876404494382024
2.5280898876404496
2.6685393258426964
2.8089887640449436
2.9494382022471908
3.089887640449438
3.230337078651685
3.3707865168539324
3.5112359550561796
3.651685393258427
3.792134831460674
3.932584269662921
4.073033707865169
4.213483146067416
4.353932584269663
4.49438202247191
4.634831460674158
4.775280898876405
4.915730337078652
5.056179775280899
5.1966292134831455
5.337078651685393
5.47752808988764
5.617977528089887
5.758426966292134
5.8988764044943816
6.039325842696629
6.179775280898876
6.320224719101123
6.46067415730337
6.601123595505618
6.741573033707865
6.882022471910112
7.022471910112359
7.162921348314606
7.303370786516854
7.443820224719101
7.584269662921348
7.724719

In [21]:
neighbourhood_gyms.set_index('City', inplace= True)

Use the gym location data to create new markers on our map. So we can see what our distribution is like. As expected the larger cities seem to have more sport facilities.

In [22]:
# add markers for each gym to the map
for lat, lng, postalcode, city in zip(neighbourhood_gyms['Venue Latitude'], neighbourhood_gyms['Venue Longitude'], neighbourhood_gyms['Postal Code'], neighbourhood_gyms.index):
        label = '{},{}'.format(postalcode, city)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
        [lat, lng],
        popup=label,
        color='red',
        radius=3,
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_netherlands)  
    
map_netherlands

Let's see what our gym distribution is like. When we queried foursquare for data, some venues might have been found from different postal codes. To get an accurate total count per city we need to remove these duplicates. Now we can calculate the total amount of gyms per city, and the total population per gym density of each city (pop/#gyms)

In [23]:
total_gyms = neighbourhood_gyms.drop_duplicates(['Postal Code','Venue Latitude','Venue Longitude','Venue']).groupby('City')['Venue'].count() #total gyms without the duplicates that are registered to different postal codes

In [71]:
df = df.join(total_gyms, on = 'City')
df.columns = ['index','City','Region','Population', 'Total Gyms']

df['gym density'] = df['Population'].str.replace(",","").astype(int)/df['Total Gyms'] #calculate the population/gym density for each city

df[['City','Population','Total Gyms','gym density']]

Unnamed: 0,City,Population,Total Gyms,gym density
0,Amsterdam,872680,294,2968.29932
1,Rotterdam,650711,164,3967.75
2,'s-Gravenhage,544766,158,3447.886076
3,Utrecht,357179,134,2665.514925
4,Eindhoven,234235,56,4182.767857
5,Groningen,232826,62,3755.258065
6,Tilburg,219632,57,3853.192982
7,Almere,211514,55,3845.709091
8,Breda,184403,45,4097.844444
9,Nijmegen,177818,45,3951.511111


In [25]:
top_cities = df.sort_values(by = 'gym density', ascending = False).head(3) #Make a selection of the 3 cities with the highest population per gym

We want to start making clusters based on the top 10 most common venues per area. This way we can try to find similarities between neighbourhoods with gyms, and locations that are not saturated yet.

In [27]:
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

In [28]:
# one hot encoding
neighbourhood_onehot = pd.get_dummies(neighbourhood_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
neighbourhood_onehot['Postal Code'] = neighbourhood_venues['Postal Code'] 
neighbourhood_onehot['City'] = neighbourhood_venues['City']

# move neighborhood column to the first column
fixed_columns = [neighbourhood_onehot.columns[-2]] + list(neighbourhood_onehot.columns[:-2])
neighbourhood_onehot = neighbourhood_onehot[fixed_columns]

In [43]:
neighbourhood_grouped = neighbourhood_onehot.groupby('Postal Code').sum().reset_index() 

In [44]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Postal Code']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Postal Code'] = neighbourhood_grouped['Postal Code']

for ind in np.arange(neighbourhood_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(neighbourhood_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()

Unnamed: 0,Postal Code,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1011,Bar,Cocktail Bar,Hostel,History Museum,South American Restaurant,Mediterranean Restaurant,Deli / Bodega,Hotel,Bed & Breakfast,Café
1,1012,Bar,Hotel,Coffee Shop,Café,Museum,Argentinian Restaurant,Cocktail Bar,Dessert Shop,Marijuana Dispensary,Gay Bar
2,1013,Restaurant,Music Venue,Café,Pizza Place,Plaza,Coffee Shop,Snack Place,Department Store,French Restaurant,Tapas Restaurant
3,1014,Nightclub,Farm,Museum,Playground,Grocery Store,Seafood Restaurant,Gas Station,Art Gallery,Restaurant,Pizza Place
4,1015,Bar,Italian Restaurant,Café,Thai Restaurant,Sandwich Place,Hotel,Museum,Bistro,Marijuana Dispensary,Seafood Restaurant


We use KMeans clustered to determine 5 different types of neighbourhoods based on their 10 most common venues.

In [45]:
from sklearn.cluster import KMeans
kclusters = 5
neighbourhood_grouped_clustering = neighbourhood_grouped.drop('Postal Code',1)

In [46]:
kmeans = KMeans(n_clusters=kclusters).fit(neighbourhood_grouped_clustering)

In [47]:
# add clustering labels
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)

neighbourhood_merged = locations

# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
neighbourhood_merged = neighbourhood_merged.join(neighborhoods_venues_sorted.set_index('Postal Code'), on='Postal Code')

neighbourhood_merged.dropna(inplace=True)
neighbourhood_merged.head() # check the last columns!

Unnamed: 0,Postal Code,Latitude,Longitude,Place,Municipality,Cluster Labels,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,1055,52.38017,4.85235,Amsterdam,Amsterdam,4.0,Bakery,Supermarket,Turkish Restaurant,Coffee Shop,Tram Station,Vegetarian / Vegan Restaurant,Bar,Playground,Indonesian Restaurant,Performing Arts Venue
1,1058,52.35789,4.85096,Amsterdam,Amsterdam,4.0,Restaurant,Café,Supermarket,Diner,Coffee Shop,Hotel,Bar,French Restaurant,Belgian Restaurant,Bus Stop
2,1062,52.35483,4.83808,Amsterdam,Amsterdam,2.0,Tram Station,Hotel,Indonesian Restaurant,Clothing Store,Movie Theater,Bus Station,Café,Park,Gym,Sandwich Place
3,1066,52.346,4.81634,Amsterdam,Amsterdam,0.0,Flower Shop,Tram Station,Gym / Fitness Center,Furniture / Home Store,Grocery Store,Bus Stop,Gym,Drugstore,Park,Supermarket
4,1068,52.35923,4.80521,Amsterdam,Amsterdam,0.0,Bakery,Tram Station,Turkish Restaurant,Shopping Mall,Chinese Restaurant,Supermarket,Snack Place,Clothing Store,Drugstore,Lingerie Store


In [48]:
import matplotlib.cm as cm
import matplotlib.colors as colors

Let's Create another map so we can see what the distribution of our clusters look like.

In [49]:
#create new map where the locations are clustered
map_netherlands_clustered = folium.Map(location=[52.40451, 4.89127], zoom_start=10)

x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]


# add markers to map
for lat, lng, city, postal_code, cluster in zip(neighbourhood_merged['Latitude'], neighbourhood_merged['Longitude'], neighbourhood_merged['Place'], neighbourhood_merged['Postal Code'], neighbourhood_merged['Cluster Labels']):
    label = '{}, {}'.format(postal_code, city)
    label = folium.Popup(label, parse_html=True)
    folium.Circle(
        [lat, lng],
        radius=500,
        popup=label,
        color=rainbow[int(cluster-1)],
        fill=True,
        fill_color=rainbow[int(cluster-1)],
        fill_opacity=0.7,
        parse_html=False).add_to(map_netherlands_clustered)  
    
map_netherlands_clustered

Now add the locations of our gyms and we might get some insights

In [50]:
# add all the gyms to the clustered map
for lat, lng, postalcode, city in zip(neighbourhood_gyms['Venue Latitude'], neighbourhood_gyms['Venue Longitude'], neighbourhood_gyms['Postal Code'], neighbourhood_gyms.index):
        label = '{},{}'.format(postalcode, city)
        label = folium.Popup(label, parse_html=True)
        folium.CircleMarker(
        [lat, lng],
        popup=label,
        color='black',
        radius=3,
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_netherlands_clustered)  
    
map_netherlands_clustered

In order to find the prime locations, we want to know which locations have fewer gyms than their cluster average.

In [51]:
neighbourhoods = neighbourhood_merged[['Postal Code', 'Cluster Labels', 'Place']] # create a new dataframe 

In [52]:
neighbourhood_gym_cluster = neighbourhoods.join(neighbourhood_gyms.groupby('Postal Code')['Venue'].count(), on = 'Postal Code').sort_values('Venue', ascending = False) # count total amount of gyms per location

In [53]:
top_neighbourhoods = neighbourhood_gym_cluster.groupby('Cluster Labels')['Venue'].mean().sort_values(ascending=False).reset_index() # calculate average amount of gyms per location for each cluster
top_neighbourhoods.rename(columns = {'Venue':'Venue Average'}, inplace=True)
top_neighbourhoods.set_index('Cluster Labels', inplace=True)


In [54]:
neighbourhood_gym_data = neighbourhood_gym_cluster.join(top_neighbourhoods, on = 'Cluster Labels')

We make a selection of all the locations where the amount of gyms is lower than their cluster average, while located in one of the top 3 cities with the highest pop/gym density

In [66]:
#some locations have no gym,so we need to fill na with zeroes. 
neighbourhood_gym_data.fillna(0)
top_locations = neighbourhood_gym_data[((neighbourhood_gym_data['Venue'] < neighbourhood_gym_data['Venue Average']) & (neighbourhood_gym_data['Place'].isin(top_cities['City'])))].sort_values('Place')
top_locations = top_locations.set_index('Postal Code').join( locations.set_index('Postal Code'), on='Postal Code', how = 'inner',lsuffix='_l_',rsuffix='_r_')
top_locations.reset_index(inplace=True)
top_locations.drop(columns = ['Place_l_','Latitude','Longitude','Place_r_'],inplace=True)
top_locations


Unnamed: 0,Postal Code,Cluster Labels,Venue,Venue Average,Municipality
0,5211,3.0,4.0,4.363636,'s-Hertogenbosch
1,5247,2.0,1.0,2.239216,'s-Hertogenbosch
2,5215,2.0,2.0,2.239216,'s-Hertogenbosch
3,5391,2.0,1.0,2.239216,'s-Hertogenbosch
4,5242,0.0,1.0,2.704545,'s-Hertogenbosch
5,5243,2.0,1.0,2.239216,'s-Hertogenbosch
6,5221,2.0,1.0,2.239216,'s-Hertogenbosch
7,5234,2.0,2.0,2.239216,'s-Hertogenbosch
8,5241,2.0,1.0,2.239216,'s-Hertogenbosch
9,5233,2.0,1.0,2.239216,'s-Hertogenbosch


Create another map to visualize the prime locations

In [56]:
# create map of New York using latitude and longitude values
map_gyms = folium.Map(location=[52.40451, 4.89127], zoom_start=10)

# add markers to map
for lat, lng, postalcode, municipality in zip(top_locations['Latitude'], top_locations['Longitude'], top_locations['Postal Code'], top_locations['Municipality']):
        label = '{},{}'.format(postalcode, municipality)
        label = folium.Popup(label, parse_html=True)
        folium.Circle(
        [lat, lng],
        popup=label,
        radius=500,
        fill=True,
        fill_opacity=0.7,
        parse_html=False).add_to(map_gyms)  
    
map_gyms

In [69]:
top_locations # list of all the locations

Unnamed: 0,Postal Code,Cluster Labels,Venue,Venue Average,Municipality
0,5211,3.0,4.0,4.363636,'s-Hertogenbosch
1,5247,2.0,1.0,2.239216,'s-Hertogenbosch
2,5215,2.0,2.0,2.239216,'s-Hertogenbosch
3,5391,2.0,1.0,2.239216,'s-Hertogenbosch
4,5242,0.0,1.0,2.704545,'s-Hertogenbosch
5,5243,2.0,1.0,2.239216,'s-Hertogenbosch
6,5221,2.0,1.0,2.239216,'s-Hertogenbosch
7,5234,2.0,2.0,2.239216,'s-Hertogenbosch
8,5241,2.0,1.0,2.239216,'s-Hertogenbosch
9,5233,2.0,1.0,2.239216,'s-Hertogenbosch


Here we have a list of 32 locations that have fewer gyms than the average comparable location, and are situated in the 3 cities with the fewest gyms. This should be a good starting point for determining where to start a new Gym.

In [68]:
df[['City','Population','Total Gyms','gym density']].sort_values('gym density')

Unnamed: 0,City,Population,Total Gyms,gym density
1,Haarlem,162864,65,2505.6
3,Utrecht,357179,134,2665.514925
9,Leiden,125434,44,2850.772727
0,Amsterdam,872680,294,2968.29932
8,Zwolle,128617,38,3384.657895
2,'s-Gravenhage,544766,158,3447.886076
5,Zaanstad,156703,44,3561.431818
5,Groningen,232826,62,3755.258065
7,Almere,211514,55,3845.709091
6,Tilburg,219632,57,3853.192982
