# Applied Data Science Capstone: A Peek into Cologne's Bars and Restaurants Scene

For the last couple of months I've been taking a series of courses in the field of Data Analysis and one of them is the IBM Data Science professional certificate, provided by Coursera. The final task of the Capstone Project is to apply Data Science for a business problem while leveraging the use of geolocation data out of Foursquare API in combination with the methods and algorithms learned throughout the course. I've chosen to analyze data from Cologne, Germany, the city I've been living for the past 5 years.

### 1. Introduction and Business Problem Statement

Cologne, founded in 38 BC, has a population of more than 1 million people, making it the forth largest city in the country. The city is part of the state of North-Rhine Westfalia, forming together with cities like Düsseldorf, Essen, Bonn, and a couple others, the Rhine-Ruhr area, known as the largest metropolitan region in Germany. It is also located in the far west of the country, having the borders of Belgium and Netherlands within a 100km range and being well-centered in Western Europe.

Home of the Kölner Dom, the tallest twin-spired Church in the world, Cologne is known for its openness and multicultural vibe, with a vibrant nightlife and a variety of options for going out, including bars, clubs and restaurants from all kinds of places in the world. It is home for some of the most relevant and biggest events around, besides being part of a circuit of best music concerts and tours one can find in the region.

This mix of history, entertainment and privileged location makes it a very good destination for tourists and a nice place to live. In this context, investing in the entertainment business in the city is often considered a good option, but then this brings a couple of questions, such as: what kind of establishment makes more sense to be opened? which locations would be best to start such a business? How competitive will be the chosen market? These are some of the questions we propose to answer with this analysis.

#### 1.1 The Problem Statement: Which Cologne neighborhood is the best to start a bar or restaurant?

A client is thinking about investing in opening an establishment in the city of Cologne, so he asked to analyze the available data and cross-reference it against some criteria in order to take a decision. Below further details:

- Two types of establishment were defined as initial targets: a bar or restaurant. It must be one of this two categories, therefore the analysis will prioritize those; 
- It is required that the new establishment is located where there is a good amount of people, preferably not so far away from the city center. In order to have an objective target, it was defined a minimum population density of 3000 people per squared kilometer;
- On the other hand it is important that his establishment has the opportunity to stand out, so a neighborhood with too much competition in the same category should be avoided;
- The place will be rented and the client is willing to pay up to 12€ per squared meter.

#### 1.2 The Target Audience

Besides the obvious interest coming from the aforementioned client in taking a data-driven decision about where to invest his money on, this study can be helpful to a couple of other stakeholders, such as:

- Other investors with a similar interest in investing in the entertainment and gastronomy business in the city of Cologne;
- Fellow Data Scientists that are looking for ideas or references on how to carry out some tasks that might be applicable to their projects, not just for Cologne, but anywhere in the world;
- Cologne Citizens in general, to get further insights on the city they live in and on which neighborhoods there are more options of venues to go and check out.

In [8]:
# Import all the required libraries and modules
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation

%pip install -c anaconda lxml --yes
%pip install -c anaconda beautifulsoup4 --yes
from bs4 import BeautifulSoup

# libraries for displaying images
from IPython.display import Image 
from IPython.core.display import HTML 

# library for chart plotting
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler # Module from Scikit-learn for Data normalization
from sklearn.cluster import KMeans # Module from Scikit-learn for clustering
from sklearn.datasets import make_blobs
!wget --quiet https://cocl.us/Geospatial_data
%pip install folium
import folium
    
# tranforming json file into a pandas dataframe library
from pandas import json_normalize

print('Folium installed')
print('Libraries imported.')

Note: you may need to restart the kernel to use updated packages.



Usage:   
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] <requirement specifier> [package-index-options] ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] -r <requirements file> [package-index-options] ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] [-e] <vcs project url> ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] [-e] <local project path> ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] <archive url/path> ...

no such option: --yes


Note: you may need to restart the kernel to use updated packages.



Usage:   
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] <requirement specifier> [package-index-options] ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] -r <requirements file> [package-index-options] ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] [-e] <vcs project url> ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] [-e] <local project path> ...
  c:\Users\ahenr\AppData\Local\Programs\Python\Python311\python.exe -m pip install [options] <archive url/path> ...

no such option: --yes
'wget' is not recognized as an internal or external command,
operable program or batch file.


Note: you may need to restart the kernel to use updated packages.
Folium installed
Libraries imported.


### 2. Data Gathering and Sorting Criteria

#### 2.1 Description

In order to perform the analysis, a combination of geolocation data called from Foursquare and external sources found on the internet will be used. In general, the required data to be collected is as follows:

- Venues position and insights: Foursquare API;
- Details on Cologne Districts: webscraping on the internet. 

Dataframes will be generated and cross-referenced in search for key information under the given criteria, which are:

- Rent costs: no more than 12€ per square meter;
- Demographic: minimum population density of 3000 people per squared kilometer
- Competition: the chosen district can't have more than 20 establishments of the same category as the one to be chosen
- Location: as close to the city center as possible


#### 2.2 Considerations on data quality and source choices

When it comes to the data quality and reliability throughout the analysis, the level of confidence on the accuracy of the datasets is assumed fairly high, given the sources used and broad availability of alternative pieces of data on the internet, that corroborate the pieces of information used. Here it is important to say that although the usage of Foursquare API as main source of geolocation data for the venues is imposed as one of the Capstone course requirements, the API is widely known for sourcing data to highly used and trusted applications, such as Apple’s Maps and Foursquare City Guide itself.

Regarding the data about the city, the accuracy of neighborhoods names, distribution, population density, latitude and longitude was assessed through a combination of cross-checking several sources and using Google Maps to validate that the geolocation was in order. The latter was necessary because Geocoder was used to find the respective coordinates and given it is essentially based on keyword searching out the Data Frame itself, mismatches could in a first moment occur.

One key concern though, was the rent price data available, for a couple reasons. The first one is that there are substantial differences between sources (there is a lot of info available based on real estate listings and websites, which might be biased somehow). Another point was that quite some sources could not be used as there was no distribution per district, making a comparison not feasible. At last, data that was too old (such as 10+ years old reports or so) could also not be considered. Given that, the criteria applied to define where the rent data would come from was based on how reputable the website to be scraped was and from which year would the report be.

### 3. Methodology and Exploratory Data Analysis

#### 3.1 Web scraping Cologne Districts Data and Cleaning

First, I've collected a list of the districts of Cologne on Wikipedia, sorted, formatted and renamed as per our needs:

In [None]:
cgn_df = pd.read_html('https://en.wikipedia.org/wiki/Districts_of_Cologne')[1]
cgn_df = cgn_df.drop([9,10])
cgn_df.drop(['Map','Coat','City parts','District Councils','Town Hall'], axis=1, inplace =True)
cgn_df.rename(columns={'City district':'District','Population1':'Population','Pop. density':'Pop_Density'}, inplace=True)
cgn_df['District'].replace({'District 1 Köln-Innenstadt':'Köln-Innenstadt','District 2 Köln-Rodenkirchen':'Köln-Rodenkirchen','District 3 Köln-Lindenthal':'Köln-Lindenthal','District 4 Köln-Ehrenfeld':'Köln-Ehrenfeld','District 5 Köln-Nippes':'Köln-Nippes','District 6 Köln-Chorweiler':'Köln-Chorweiler','District 7 Köln-Porz':'Köln-Porz','District 8 Köln-Kalk':'Köln-Kalk','District 9 Köln-Mülheim':'Köln-Mülheim'}, inplace=True)
cgn_df.sort_values('Pop_Density', inplace=True, ascending=False)
cgn_df['Pop_Density'] = cgn_df['Pop_Density'].str.replace('/km²','').str.replace('.','').astype(float)
cgn_df

In [None]:
labels = ['Köln-Innenstadt','Köln-Ehrenfeld','Köln-Nippes','Köln-Lindenthal','Köln-Kalk','Köln-Mülheim','Köln-Rodenkirchen','Köln-Porz','Köln-Chorweiler']

ind = np.arange(len(cgn_df['Pop_Density']))  
width = 0.3

fig, ax = plt.subplots(figsize=(16,8))
rects = ax.bar(ind, cgn_df['Pop_Density'], width, label=labels, color='#5bc0de')
ax.set_title("Cologne Neighborhoods - Population Density Bar Plot", fontsize=16)
ax.set_xticks(ind)
ax.set_xticklabels((labels))
plt.ylabel('Population Density (Pop./km²)')
plt.xlabel('Districts')
ax.get_yaxis().set_visible(True)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(True)
ax.spines['right'].set_visible(False)



def autolabel(rects, xpos='center'):

    ha = {'center': 'center', 'right': 'left', 'left': 'right'}
    offset = {'center': 0, 'right': 1, 'left': -1}

    for rect in rects:
        height = rect.get_height().round(2)
        ax.annotate('{}'.format(height),
                    xy=(rect.get_x() + rect.get_width() / 2, height),
                    xytext=(offset[xpos]*3, 3),  # use 3 points offset
                    textcoords="offset points",  # in both directions
                    ha=ha[xpos], va='bottom', fontsize=14)
        
autolabel(rects, "center")

fig.tight_layout()

plt.show()

As shown on the cleaned dataframe, only 4 districts have more than 3000 people per squared kilometer, therefore all data not falling in the criteria needs to be dropped:

In [None]:
cgn_df.drop(cgn_df.index[4:], axis=0, inplace=True)
cgn_df

#### 3.2 Getting Districts' Coordinates

The next step was to find and append the coordinate data for each of the districts. For that, I have used the geolocator library:

In [9]:
%pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

Collecting geopy
  Downloading geopy-2.4.1-py3-none-any.whl.metadata (6.8 kB)
Collecting geographiclib<3,>=1.52 (from geopy)
  Downloading geographiclib-2.0-py3-none-any.whl.metadata (1.4 kB)
Downloading geopy-2.4.1-py3-none-any.whl (125 kB)
Downloading geographiclib-2.0-py3-none-any.whl (40 kB)
Installing collected packages: geographiclib, geopy
Successfully installed geographiclib-2.0 geopy-2.4.1
Note: you may need to restart the kernel to use updated packages.


In [None]:
geolocator = Nominatim(user_agent="Cologne Explorer")
cgn_df['Districts_Coord'] = cgn_df['District'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
cgn_df[['Latitude', 'Longitude']] = cgn_df['Districts_Coord'].apply(pd.Series)
cgn_df.drop(['Districts_Coord'], axis=1, inplace=True)
cgn_df

Please note, that with the population density data already scraped it was possible to rule out a couple of neighborhoods that did not fall into the criteria. As a result, out of 9 neighborhoods that are part of the city of Cologne, remained:

· Köln-Innenstadt (also often referenced as the city center)

· Köln-Ehrenfeld

· Köln-Nippes

· Köln-Lindenthal

#### 3.3 Getting Rent prices data and cleaning

Then, I've gathered data on the rent prices per squared meter for every district through official reports published on www.koeln.de

In [None]:
rent_df = pd.read_html('https://www.koeln.de/immobilien/mietspiegel.html')[0]
rent_df.drop(['3. Quartal 2018','Veränderung'], axis=1, inplace=True)
rent_df.rename(columns={'Stadtteil':'District','4. Quartal 2018/04':'Price_m2_2018'}, inplace=True)
rent_df.head()

After that, I've also formatted and sorted the resulting dataframe accordingly

In [None]:
neigh_list = ['Innenstadt','Ehrenfeld','Nippes','Lindenthal']
rent_array = rent_df[rent_df.District.isin(neigh_list)]
rent_df2 = pd.DataFrame(rent_array, columns=['District','Price_m2_2018'])
rent_df2['Price_m2_2018'] = rent_df2['Price_m2_2018'].str.replace('€','').str.replace(',','.').astype(float)
rent_df2.sort_values('Price_m2_2018', inplace=True, ascending=False)
rent_df2

In [None]:
# No term "Innestadt" found on rent_df, therefore district was missing on rent_df2. Being the Innenstadt district mainly composed of the Altstadt-Süd und Altstadt-Nord, it was taken the average of both to be appended into the new dataframe
innenstadt = pd.DataFrame({'District':['Innenstadt'],'Price_m2_2018':[rent_df.iloc[0:2]['Price_m2_2018'].str.replace('€','').str.replace(',','.').astype(float).mean().round(1)]}, columns=['District','Price_m2_2018'])
rent_df2 = rent_df2.append(innenstadt)
rent_df2.sort_values('Price_m2_2018', inplace=True, ascending=False)
rent_df2['District'].replace({'Innenstadt':'Köln-Innenstadt','Rodenkirchen':'Köln-Rodenkirchen','Lindenthal':'Köln-Lindenthal','Ehrenfeld':'Köln-Ehrenfeld','Nippes':'Köln-Nippes','Chorweiler':'Köln-Chorweiler','Porz':'Köln-Porz','Kalk':'Köln-Kalk','Mülheim':'Köln-Mülheim'}, inplace=True)
rent_df2

In [None]:
labels = ['Köln-Lindenthal', 'Köln-Ehrenfeld', 'Köln-Nippes', 'Köln-Innenstadt']

ind = np.arange(len(rent_df2['Price_m2_2018']))  
width = 0.3

fig, ax = plt.subplots(figsize=(16,8))
rects = ax.bar(ind, rent_df2['Price_m2_2018'], width, label=labels, color='#5cb85c')
ax.set_title("Cologne Neighborhoods - Rental Price 2018", fontsize=16)
ax.set_xticks(ind)
ax.set_xticklabels((labels))
plt.ylabel('Rental Price')
plt.xlabel('Districts')
ax.get_yaxis().set_visible(True)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(True)
ax.spines['right'].set_visible(False)

autolabel(rects, "center")

fig.tight_layout()

plt.show()

According to the resulting dataframe, two districts had already a rent price of more than 12€ per squared meter. We will hence drop the respective rows in the main dataframe:

In [None]:
cgn_df.drop([cgn_df.index[1],cgn_df.index[3]], axis=0, inplace=True)
cgn_df

#### 3.4 Getting Venues Data through Foursquare API

As mentioned before, as a requirement for this Capstone, we needed to call the venues data through Foursquare API. The process is straightforward, and it was made easier by the previous filtering done on the city data.

Here we need to start taking a look into insights on both bars and restaurants, so I’ve created a couple of variables to deal with each, query_1 and query_2 for the bar and restaurant queries, respectively and 4 for the URLs, 2 per district, one for each type of venue. The radius was established through a couple of iterations done during the data clustering. The method to do it was basically based on previous knowledge of the city, while checking on the map the most senseful way of distributing the data, bigger radius would return inaccuracy on to which neighborhood the venue would belong to, so I’ve used different figures until the boundaries were respected accordingly:

In [None]:
CLIENT_ID = '2QLU5CN20BM5KPMDHX2QYYBWG13KZAYXAYOIQX0S0300KW01' # your Foursquare ID
CLIENT_SECRET = 'RND2YVJCGI10NSERVO4LCVGFYETQUROC3OCBD0MMK1XPFKKA' # your Foursquare Secret
VERSION = '20200429'
LIMIT = 200
print('Your credentials:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

In [None]:
query_1 = 'Bar'
query_2 = 'Restaurant'
radius = 1250
print('Queries OK!')

In [None]:
url_inne1 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, cgn_df.iloc[0][4], cgn_df.iloc[0][5], VERSION, query_1, radius, LIMIT)
url_nipp1 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, cgn_df.iloc[1][4], cgn_df.iloc[1][5], VERSION, query_1, radius, LIMIT)

url_inne2 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, cgn_df.iloc[0][4], cgn_df.iloc[0][5], VERSION, query_2, radius, LIMIT)
url_nipp2 = 'https://api.foursquare.com/v2/venues/search?client_id={}&client_secret={}&ll={},{}&v={}&query={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, cgn_df.iloc[1][4], cgn_df.iloc[1][5], VERSION, query_2, radius, LIMIT)


print("URLs created!")

In [None]:
results_inne1 = requests.get(url_inne1).json()
results_nipp1 = requests.get(url_nipp1).json()

results_inne2 = requests.get(url_inne2).json()
results_nipp2 = requests.get(url_nipp2).json()

print("GET Request done, results generated")

In [None]:
venues_inne1 = results_inne1['response']['venues']
venues_nipp1 = results_nipp1['response']['venues']

venues_inne2 = results_inne2['response']['venues']
venues_nipp2 = results_nipp2['response']['venues']

print("JSON File sorted out")

With those variables I was now able to create Data frames out of it, below one example of the coding used:

In [None]:
df_inne1 = pd.json_normalize(venues_inne1)
df_inne1['District'] = 'Köln-Innenstadt'
df_inne1['Category'] = 'Bar'
df_inne1.head()

In [None]:
df_nipp1 = json_normalize(venues_nipp1)
df_nipp1['District'] = 'Köln-Nippes'
df_nipp1['Category'] = 'Bar'

In [None]:
df_inne2 = json_normalize(venues_inne2)
df_inne2['District'] = 'Köln-Innenstadt'
df_inne2['Category'] = 'Restaurant'

In [None]:
df_nipp2 = json_normalize(venues_nipp2)
df_nipp2['District'] = 'Köln-Nippes'
df_nipp2['Category'] = 'Restaurant'

#### 3.5 Cleaning the Venues data

After I called the data out of Foursquare, I still needed to clean and prepare the data. The two main tasks were to drop everything that would not be relevant and to blend the different data-frames created, as they were at this point separated.

In [None]:
df_merged = pd.concat([df_inne1, df_inne2, df_nipp1, df_nipp2])
df_clean = df_merged[['name','Category','District','location.address','location.postalCode','location.lat','location.lng','location.distance']]
df_clean.rename({'name':'Venue','location.address':'Address','location.postalCode':'PostalCode','location.lat':'LocationLatitude','location.lng':'LocationLongitude','location.distance':'LocationDistance'}, axis=1, inplace=True)
df_clean.head()

Below the distribution of the venue locations on the city map, being Köln-Nippes represented in red and Köln-Innenstadt in blue:

In [None]:
# create a Stamen Toner map of the world centered around Cologne, Germany
cologne_map = folium.Map(location=[50.936631, 6.958401], zoom_start=10)

# instantiate a feature group for the incidents in the dataframe
locations = folium.map.FeatureGroup()

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, dist in zip(df_clean.LocationLatitude, df_clean.LocationLongitude, df_clean.District):
    if dist == 'Köln-Innenstadt':
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6))
    
    elif dist == 'Köln-Nippes':
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='red',
            fill=True,
            fill_color='red',
            fill_opacity=0.6))    

    else:
           pass
            
cologne_map.add_child(locations)

In [None]:
nipp_df = pd.concat([df_nipp1, df_nipp2])
nipp_df = nipp_df[['name','Category','District','location.address','location.postalCode','location.lat','location.lng','location.distance']]
nipp_df.rename({'name':'Venue','location.address':'Address','location.postalCode':'PostalCode','location.lat':'LocationLatitude','location.lng':'LocationLongitude','location.distance':'LocationDistance'}, axis=1, inplace=True)

nipp_df.head()

In [None]:
nipp_df['Category'].value_counts()

In [None]:
inne_df = pd.concat([df_inne1, df_inne2])
inne_df = inne_df[['name','Category','District','location.address','location.postalCode','location.lat','location.lng','location.distance']]
inne_df.rename({'name':'Venue',
                'location.address':'Address',
                'location.postalCode':'PostalCode',
                'location.lat':'LocationLatitude',
                'location.lng':'LocationLongitude',
                'location.distance':'LocationDistance'}, 
                axis=1, 
                inplace=True)

inne_df.head()

In [None]:
inne_df['Category'].value_counts()

One thing that calls attention is that there is a clear difference on the amount of venues between the two neighborhoods, which is in fact related to the next topic to be investigated.

#### 3.6 Competition benchmark

One key criterion established in order to choose which kind of venue and neighborhood would be chosen was competition. Let us look at the Neighborhoods number’s breakdown:

In [None]:
neigh_data = {'Bar': [50,17],
              'Restaurant': [50,6]
        }

comp_df = pd.DataFrame(neigh_data, columns = ['Bar','Restaurant'], index=['Köln-Innenstadt','Köln-Nippes'])
comp_df

In [None]:
labels =['Köln-Innenstadt','Köln-Nippes']
bar = comp_df['Bar']
restaurant = comp_df['Restaurant']


ind = np.arange(len(bar))  
width = 0.2

fig, ax = plt.subplots(figsize=(10,8))
rects1 = ax.bar(ind - width, bar, width, label='Bar', color='#5cb85c')
rects2 = ax.bar(ind, restaurant, width, label='Restaurant', color='#5bc0de')


ax.set_title("Neighborhoods Benchmark: Venues Breakdown", fontsize=16)
ax.set_xticks(ind)
plt.ylabel('Number of Venues')
ax.set_xticklabels((labels))
ax.get_yaxis().set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.legend(fontsize=14)

autolabel(rects1, "center")
autolabel(rects2, "center")


fig.tight_layout()

plt.show()

### 4. Results

According to the competition target defined in the beginning of the analysis, there should be no more than 20 venues falling into the same venue category in the chosen neighborhood, which necessarily means that between the two districts left, Köln-Nippes have more appeal when considering the level of competition.

There is still one decision to be taken: what would make more sense to be opened, a bar or restaurant? When looking at the data, there are 17 bars located in Nippes according to the Foursquare data, and 6 restaurants. In order to give perspective, a couple of considerations are due. First, about bars, the kind of entertainment is more uniform among themselves, unless of course the bar is thematic, is a franchise or is already well known and marketed. Restaurants, on the other hand, might not compete as strongly with each other due to the whole variety of typical food available to be chosen when considering such establishment, especially in such cosmopolitan town as Cologne. This may incur in an exception depending on several different factors that are not covered by this study, such as opening hours, pricing and reviews available on the internet. Under these assumptions, **the recommendation proposed by the study is to open a restaurant in the neighborhood Köln-Nippes.**

### 5. Discussion

A couple of further considerations need to be mentioned. The first is that, although the data shows a clear path with a well-defined rationale in order to reach the conclusion mentioned above, several factors were not part of the criteria used in the analysis. This means that the business case could be slightly different depending on what kind of other targets were to be established. As an example, one key factor to make the decision to lean towards Nippes instead of Innenstadt was the competition targets. What would happen if we also consider internet reviews benchmark and more detailed rent data for every neighborhood? What about if the competition target was substantially higher than 20? Another aspect is the distance to the city center, which for our use case did not need to be checked deeper, as the neighborhoods that were more distant from downtown were ruled out early in the analysis. But what would happen if the average income were to be considered? Köln-Lindenthal is known for being a higher end neighborhood, which means more expensive (this makes sense when we look at the rent data scraped during the study, this neighborhood was the most expensive). On the other hand, not being that close to the city center could be an asset, depending on a couple of non-tangible aspects that also were not part of this study. Below a visual reference of the venues of 4 of the main neighborhoods distributed in Cologne’s map:

In [25]:
X = df_clean[['LocationLatitude','LocationLongitude']].values[:,1:]
X = np.nan_to_num(X)
cluster_dataset = StandardScaler().fit_transform(X)

In [None]:
num_clusters = 5

k_means = KMeans(init="k-means++", n_clusters=num_clusters, n_init=6)
k_means.fit(cluster_dataset)
labels = k_means.labels_

print(labels)

In [None]:
df_clean["ClusterLabels"] = labels
df_clean.head(5)

In [None]:
# create a Stamen Toner map of the world centered around Cologne, Germany
cologne_map = folium.Map(location=[50.936631, 6.958401], zoom_start=10)

# instantiate a feature group for the incidents in the dataframe
locations = folium.map.FeatureGroup()

# loop through the 100 crimes and add each to the incidents feature group
for lat, lng, label in zip(df_clean.LocationLatitude, df_clean.LocationLongitude, df_clean.ClusterLabels):
    if label == 0:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='blue',
            fill=True,
            fill_color='blue',
            fill_opacity=0.6))
    elif label == 1:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='yellow',
            fill=True,
            fill_color='yellow',
            fill_opacity=0.6))
    elif label == 2:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='red',
            fill=True,
            fill_color='red',
            fill_opacity=0.6))
    elif label == 3:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='green',
            fill=True,
            fill_color='green',
            fill_opacity=0.6))
    elif label == 4:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='orange',
            fill=True,
            fill_color='orange',
            fill_opacity=0.6))
    elif label == 5:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='pink',
            fill=True,
            fill_color='pink',
            fill_opacity=0.6))
    elif label == 6:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='purple',
            fill=True,
            fill_color='purple',
            fill_opacity=0.6))
    elif label == 7:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='brown',
            fill=True,
            fill_color='brown',
            fill_opacity=0.6))
    elif label == 8:
        locations.add_child(
        folium.features.CircleMarker(
            [lat, lng],
            radius=5, # define how big you want the circle markers to be
            color='violet',
            fill=True,
            fill_color='violet',
            fill_opacity=0.6))
        

    else:
           pass
            
cologne_map.add_child(locations)

In [None]:
clust_df = round(df_clean.sort_values('ClusterLabels'),2)
clust_df.head()

By considering more of these non-tangible aspects, not just Lindenthal (in green on the map) but also Ehrenfeld (in yellow) could be more attractive according to certain specific contexts. The bottom line here is that, although the study is completely data-driven, it serves as a good initial reference of what are key important aspects for similar business cases only, so in no way it can propose itself to be an ultimate reference for decision making when considering other kind of criteria. Here it is important to say that, for the purposes of avoiding scope creep and to keep the Capstone Project comprehensiveness as concise as possible, the number of variables and criteria were kept to a minimum.

### 6. Conclusion
In this analysis, I proposed a discussion, based on objective targets, on which neighborhood of the city of Cologne would have more appeal to open either a bar or a restaurant. After conducting this study, we can infer that there are substantial differences between different parts of the city of Cologne whenever considering starting a business in the entertainment and gastronomy industries. Through several different data analysis techniques and well-defined criteria, we were able to see in a step-by-step approach how to apply Data Science principles in order to take an informed decision. Being myself a citizen of Cologne and living in the city for a couple of years, conducting this study not just meant a lot of learning, but also brought joy and satisfaction not just because I could see several points of convergence in the reached conclusions with the daily routine, but also because of the opportunity to getting to know a bit more of the town I live in. If you are a fellow Data Science student going through the Capstone project and using this material as reference for your own analysis, I hope you find it useful and that you, the same way as me, can see the value in going through the whole process, and can have as much fun as I did by doing this work.