# Data Science Capstone Project

##### In this notebook, we will explore the music scene the cities in Germany and we will find the best spot to do a music festival.

We have been tasked with finding the best spot to do a music festival in Germany. It must be a place that will be near as many people as possible, but it must be outside of a city. We also want it to be accessible, as it can't be some spot in the middle of the forest, so we're looking for a spot that has a camping site to be in the vicinity. We believe that more people will come to our festival if there is a strong music scene in the nearby cities, and that the music scene in any given city can be measured by the number of music venues. Thus, we aim to find the caping site in Germany which is close to as many music venues as possible.

In [1]:
# Import all necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
from geopy.geocoders import Nominatim 
import folium
from pandas.io.json import json_normalize
from sklearn.cluster import KMeans 
import matplotlib.cm as cm
import matplotlib.colors as colors

We will first get a dataframe containing all cities in Germany from this <a href="https://en.wikipedia.org/wiki/List_of_cities_in_Germany_by_population">Wikipedia page</a>. 

In [2]:
url = "https://en.wikipedia.org/wiki/List_of_cities_in_Germany_by_population"

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser") # Get Wikipedia page in HTML
neigh_table = soup.find("table") # Find the tаble we want

cities_df = pd.read_html(str(neigh_table))[0] # Encode table as pandas dataframe

# Clean the data
cities_df = cities_df.loc[:, ['City', 'State', '2015estimate']]
cities_df.columns = ['City', 'State', 'Population (estimate)']

# Obtain geographical coordinates for each city
# Initialise variables
latitude = []
longitude = []

for address in cities_df.loc[:, 'City']:
    geolocator = Nominatim(user_agent="germany_explorer")
    location = geolocator.geocode(address)
    latitude.append(location.latitude)
    longitude.append(location.longitude)

cities_df['Latitude'] = latitude
cities_df['Longitude'] = longitude

cities_df

Unnamed: 0,City,State,Population (estimate),Latitude,Longitude
0,Berlin,Berlin,3520031,52.517037,13.388860
1,Hamburg,Hamburg,1787408,53.550341,10.000654
2,Munich (München),Bavaria,1450381,48.117407,11.557851
3,Cologne (Köln),North Rhine-Westphalia,1060582,50.938361,6.959974
4,Frankfurt am Main,Hesse,732688,50.110644,8.682092
...,...,...,...,...,...
74,Erlangen,Bavaria,108336,49.598119,11.003645
75,Moers,North Rhine-Westphalia,104529,51.451283,6.628430
76,Siegen,North Rhine-Westphalia,102355,50.874980,8.022723
77,Hildesheim,Lower Saxony,101667,52.152164,9.951305


Let us map the cities to check that everything is correct.

In [3]:
# Obtain geographical coordinates of Germany
address = 'Germany'

geolocator = Nominatim(user_agent="music_explorer")
location = geolocator.geocode(address)
germany_lat = location.latitude
germany_lng = location.longitude

# Create map of Germany using latitude and longitude values
test_map = folium.Map(location=[germany_lat, germany_lng], zoom_start=6)

# Add markers to map
for lat, lng, label in zip(cities_df['Latitude'], cities_df['Longitude'], cities_df['City']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(test_map)  
    
test_map

We now need to get the information on the music venues in each city. We will use the Foursquare API to fetch the data. First, we will test the process with Berlin.

```
# Foursquare credentials
CLIENT_ID = 'XXX'
CLIENT_SECRET = 'XXX'
ACCESS_TOKEN = 'XXX'
VERSION = '20180604'
LIMIT = 100
```

In [5]:
search_query = 'Music'
radius = 4000

berlin_lat = cities_df[cities_df['City'] == 'Berlin'].loc[:, 'Latitude'].iloc[0]
berlin_long = cities_df[cities_df['City'] == 'Berlin'].loc[:, 'Longitude'].iloc[0]

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, berlin_lat, berlin_long, ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# Assign relevant part of JSON to venues
venues = results['response']['venues']

# Tranform venues into a dataframe
dataframe = json_normalize(venues)

# Keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name', 'categories'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# Function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

# Filter the category for each row
dataframe_filtered['categories'] = dataframe_filtered.apply(get_category_type, axis=1)

# Clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

dataframe_filtered.head()

Unnamed: 0,name,categories,address,lat,lng,labeledLatLngs,distance,postalCode,cc,city,state,country,formattedAddress,neighborhood,crossStreet,id
0,Sony/ATV Music Publishing Germany,Office,Kemperplatz 1,52.50937,13.373688,"[{'label': 'display', 'lat': 52.50936956797032...",1335,10785,DE,Berlin,Berlin,Deutschland,"[Kemperplatz 1, 10785 Berlin, Deutschland]",,,4dcd042a22718eed7a136582
1,Berlin Music Commission (BMC),Music Venue,Brückenstr. 1,52.511071,13.416539,"[{'label': 'display', 'lat': 52.51107105502184...",1989,10179,DE,Berlin,Berlin,Deutschland,"[Brückenstr. 1, 10179 Berlin, Deutschland]",,,4da495b4b521224befd835ee
2,TIDAL / WiMP Music,Tech Startup,Rosenthaler Strasse 23,52.529638,13.401823,"[{'label': 'display', 'lat': 52.52963768551764...",1654,10119,DE,Berlin,Berlin,Deutschland,"[Rosenthaler Strasse 23, 10119 Berlin, Deutsch...",,,5465e1d7498e40bd82c54329
3,Chamber Music Hall (Kammermusiksaal),Concert Hall,Herbert-von-Karajan-Str. 1,52.509411,13.368934,"[{'label': 'display', 'lat': 52.50941070233693...",1594,10785,DE,Berlin,Berlin,Deutschland,"[Herbert-von-Karajan-Str. 1, 10785 Berlin, Deu...",,,4adcda8bf964a520e34921e3
4,KMG Kobalt Music Germany,Office,Oberwallstr. 1,52.518634,13.397021,"[{'label': 'display', 'lat': 52.518634, 'lng':...",580,10117,DE,Berlin,Berlin,Deutschland,"[Oberwallstr. 1, 10117 Berlin, Deutschland]",,,4b60275cf964a520a7d729e3


Here we have all music venues in Berlin. Let's see them on a map.

In [6]:
# Create map of Berlin using latitude and longitude values
berlin_map = folium.Map(location=[berlin_lat, berlin_long], zoom_start=12)

# Add markers to map
for lat, lng, label in zip(dataframe_filtered['lat'], dataframe_filtered['lng'], dataframe_filtered['name']):
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(berlin_map)  
    
berlin_map

Since everything seems to be working, we will create a function that gets the music venues for a given city or list of cities.

In [7]:
def getVenues(search_query, names, latitudes, longitudes, radius=4000):
    """This function returns the search results of search_query within radius of latitude and longitude, 
    where latitudes and longitudes are the coordinates of the cities in the list names."""
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            lat, 
            lng, 
            ACCESS_TOKEN, 
            VERSION, 
            search_query, 
            radius, 
            LIMIT)
        
        
        results = requests.get(url).json()['response']['venues']
        
        # return only relevant information for each nearby venue
        temp = []
        for v in results:
            
            try:
                temp.append([name, lat, lng, v['name'], v['location']['lat'], v['location']['lng'], v['categories'][0]['name']])
            
            except:
                temp.append([name, lat, lng, v['name'], v['location']['lat'], v['location']['lng'], 'Unknown'])
                
        venues_list.append(temp)

    venues_df = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    venues_df.columns = ['City', 
                  'City Latitude', 
                  'City Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(venues_df)

Now let us run the function with our parameters.

In [8]:
music_venues = getVenues('Music', cities_df['City'], cities_df['Latitude'], cities_df['Longitude'])

music_venues.head()

Unnamed: 0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Berlin,52.517037,13.38886,Sony/ATV Music Publishing Germany,52.50937,13.373688,Office
1,Berlin,52.517037,13.38886,Berlin Music Commission (BMC),52.511071,13.416539,Music Venue
2,Berlin,52.517037,13.38886,TIDAL / WiMP Music,52.529638,13.401823,Tech Startup
3,Berlin,52.517037,13.38886,Chamber Music Hall (Kammermusiksaal),52.509411,13.368934,Concert Hall
4,Berlin,52.517037,13.38886,KMG Kobalt Music Germany,52.518634,13.397021,Office


We now want to find the perfect spot for our festival. We want it to have as many music venues as possible in the vicinity, but we want it to be outside of a city. First, we will use k-Means clustering in order to cluster venues by geographic location and find the centre of those clusters.

In [9]:
# Initialize the model
k_clusters = 5
k_means = KMeans(init="k-means++", n_clusters = k_clusters, n_init=12)

# Fit model using Venue Latitude and Longitude
k_means.fit(music_venues.loc[:, ['Venue Latitude', 'Venue Longitude']])

# Find the centres of the clusters and define their latitude and longitude
cluster_centers = k_means.cluster_centers_
cluster_centers_lat = [x for y in cluster_centers[:, [0]].tolist() for x in y]
cluster_centers_lng = [x for y in cluster_centers[:, [1]].tolist() for x in y]

# Insert row of cluster labels to our dataframe
music_venues.insert(0, 'Cluster Labels', k_means.labels_)

We now have the cluster centres. Let us see what these look like in a map:

In [10]:
# Create map of Germany using latitude and longitude values
festival_spots_map = folium.Map(location=[germany_lat, germany_lng], zoom_start=6)

# Add markers to map
for lat, lng in zip(cluster_centers_lat, cluster_centers_lng):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(festival_spots_map)  
    
festival_spots_map

Let's map the clusters, colour-coded, together with their centres.

In [11]:
# Create map
map_clusters = folium.Map(location=[germany_lat, germany_lng], zoom_start=6)

# Set color scheme for the clusters
x = np.arange(k_clusters)
ys = [i + x + (i*x)**2 for i in range(k_clusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# Add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(music_venues['Venue Latitude'], music_venues['Venue Longitude'], music_venues['Venue'], music_venues['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster],
        fill=True,
        fill_color=rainbow[cluster],
        fill_opacity=0.7).add_to(map_clusters)
    
# Add cluster centres to map
for lat, lng in zip(cluster_centers_lat, cluster_centers_lng):
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_clusters) 
       
map_clusters

We cannot tell from them map exactly how many venues are in each cluster. Let's see how many there are:

In [12]:
cluster_counts = music_venues.groupby('Cluster Labels').count().sort_values('City', ascending = False)

cluster_counts

Unnamed: 0_level_0,City,City Latitude,City Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Cluster Labels,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,109,109,109,109,109,109,109
3,97,97,97,97,97,97,97
2,81,81,81,81,81,81,81
0,71,71,71,71,71,71,71
4,52,52,52,52,52,52,52


It seems like the cluster near Cologne is the one near the most venues, so we will choose this point to do our festival. Let's get the latitude and longitude:

In [13]:
# Find best cluster
best_cluster = cluster_counts.head(1).index[0]

# Find latitude and longitude of best cluster
festival_lat = cluster_centers_lat[best_cluster]
festival_lng = cluster_centers_lng[best_cluster]

print('The coordinates of the ideal spot for our festival are: ', festival_lat, festival_lng)

The coordinates of the ideal spot for our festival are:  51.19117107231339 6.992926397450889


We now needd to find the nearest camping to this spot. To do this we will use Foursquare:

In [14]:
search_query = 'Camping'
radius = 10000

url = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&oauth_token={}&v={}&query={}&radius={}&limit={}'.format(
    CLIENT_ID, CLIENT_SECRET, festival_lat, festival_lng, ACCESS_TOKEN, VERSION, search_query, radius, LIMIT)

results = requests.get(url).json()

# Assign relevant part of JSON to venues
spots = results['response']['venues']

# Tranform venues into a dataframe
dataframe = json_normalize(spots)

# Keep only columns that include venue name, and anything that is associated with location
filtered_columns = ['name'] + [col for col in dataframe.columns if col.startswith('location.')] + ['id']
dataframe_filtered = dataframe.loc[:, filtered_columns]

# Clean column names by keeping only last term
dataframe_filtered.columns = [column.split('.')[-1] for column in dataframe_filtered.columns]

# Sort by distance
camping_df = dataframe_filtered.sort_values('distance').head(1)

camping_df

Unnamed: 0,name,lat,lng,labeledLatLngs,distance,cc,country,formattedAddress,address,postalCode,city,state,id
0,Campingplatz Unterbacher See/Nord,51.199446,6.886507,"[{'label': 'display', 'lat': 51.1994459568629,...",7480,DE,Deutschland,[Deutschland],,,,,51f8026d498e06235fd81dd3


It looks like we found our best option. Let us see it on the map:

In [15]:
perfect_spot_lat = camping_df.loc[:, 'lat'].iloc[0]
perfect_spot_lng = camping_df.loc[:, 'lng'].iloc[0]

# Create map using latitude and longitude values of our spot
perfect_spot_map = folium.Map(location=[perfect_spot_lat, perfect_spot_lng], zoom_start=14)

# Add markers to map
folium.CircleMarker([perfect_spot_lat, perfect_spot_lng],
        radius=5,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(perfect_spot_map)  
    
perfect_spot_map