# Different cities, similar neighborhoods?

### Introduction / Scope
**Berlin and Hamburg** are Germany's two largest cities. While Berlin is counting roughtly 3.6 million inhabitants and Hamburg roughtly 1.8 million, both cities are quite different. **Berlin** is filled with an eclectic mix of history, culture and gorgeous sights, it’s a city that intrigues yet embraces visitors with open arms. The city is bursting with internationality as a result of many new residents from all over the world.
Germany’s second city, Hanseatic **Hamburg** is both typically German and unique in its own way. With a long maritime history, this North Sea port city has a distinct feel from anywhere else in the country and tons of cool things to see and do.

Both cities beeing in the northern part of Germany, there is a great exchange between residents of both cities. **This analysis is intended to show which areas of one city resemble those of the other: Berlin and Hamburg.** This could be helpfull for different use case (and therefore different stakeholders:
* *People moving from one city to the other* often would like to live in a very specific type of neighborhood. This comparisson can help those to filter for areas similar (or even different, if you are up for something new) to what you are used to.
* *Companies expanding within one of the cities* might want to look for similar type of neighborhoods, as they are targeting a specific user group. The comparisson can be used for a first indication.
* *Companies expanding from one city to the other* might also try to find a neighborhood to settle in first. They can use their experience from the original city and look for a fitting (e.g. similar) neighborhood in the second one.

                                                *Hamburg*
<a data-flickr-embed="true" href="https://www.flickr.com/photos/danczw/43628571765/in/album-72157649772585816/" title="Glossy Jungfernstieg, Hamburg, Germany"><img src="https://live.staticflickr.com/1870/43628571765_27534c8e63_w.jpg" width="400" height="203" alt="Glossy Jungfernstieg, Hamburg, Germany"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>

                                                *Berlin*
<a data-flickr-embed="true" href="https://www.flickr.com/photos/danczw/40557928022/in/album-72157649772585816/" title="Behind the Brandenburger Tor, Berlin, Germany"><img src="https://live.staticflickr.com/4625/40557928022_59d7564f6a_w.jpg" width="400" height="218" alt="Behind the Brandenburger Tor, Berlin, Germany"></a><script async src="//embedr.flickr.com/assets/client-code.js" charset="utf-8"></script>


### Data
Two different kind of data is needed for the comparison.
1. **City neighborhood and respective geographical data:** in order to analyse the cities on a meaningfull level, they need to be divided into differen areas, e.g. neighborhoods, boroughs. Luckily both is available as a Github repro and can be found [here](https://github.com/zauberware/postal-codes-json-xml-csv). This data includes 11 rows, of which only state (i.e. Berlin or Hamburg), zipcode (10*** for Berlin, 22*** for Hamburg), latitude and longitude per zipcode will be needed. Zipcodes will be used for dividing both cities into smaller areas. Zipcode is chosen as it is an unique identifier. In the following this data is cleaned and filtered to what is needed.

2. **Venue data:** The first 100 venues per zipcode in both Hamburg and Berlin are scraped in order to cluster the different areas.  This data, including the Venue name, its categroy, latitude and longitude, is gathered using the Foursquare API.

In [18]:
# importing standard libraries
import pandas as pd
import numpy as np

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# !conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if folium is not installed
import folium # map rendering library

# import k-means from clustering stage
from sklearn.cluster import KMeans

# library to handle requests
import requests

# !conda install -c conda-forge geopy --yes # uncomment this line if geopy is not installed
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
print('Import done!')

Import done!


In [40]:
# import csv with all german postal codes
# source: https://github.com/zauberware/postal-codes-json-xml-csv/blob/master/data/DE/zipcodes.de.csv - by simonfranzen on github
import types
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='ID',
    ibm_auth_endpoint="https://iam.eu-de.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')

body = client.get_object(Bucket='id',Key='zipcodes.de.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

zipcode_Ger = pd.read_csv(body)
print(f'Shape of basic data frame with {zipcode_Ger.shape[0]} rows (i.e. zipcodes) and {zipcode_Ger.shape[1]} columns, including:')
print(zipcode_Ger.dtypes)

Shape of basic data frame with 16481 rows (i.e. zipcodes) and 11 columns, including:
country_code       object
zipcode             int64
place              object
state              object
state_code         object
province           object
province_code       int64
community          object
community_code      int64
latitude          float64
longitude         float64
dtype: object


In [44]:
#drop columns not needed and keep all zipcodes for the state of Berlin and Berlin
zipcode_Ber = zipcode_Ger[['zipcode', 'state', 'latitude','longitude']].copy()
zipcode_Ber = zipcode_Ber[zipcode_Ber.state == 'Berlin']
zipcode_Ber.reset_index(drop = True, inplace = True)

zipcode_Ha = zipcode_Ger[['zipcode', 'state', 'latitude','longitude']].copy()
zipcode_Ha = zipcode_Ha[zipcode_Ha.state == 'Hamburg']
zipcode_Ha.drop_duplicates(subset=['zipcode'], keep='first', inplace = True)
zipcode_Ha.reset_index(drop = True, inplace = True)

print(f'Shape: {zipcode_Ber.shape}')

#drop columns not needed and keep all zipcodes for the state of Hamburg
zipcode_Ha = zipcode_Ger[['zipcode', 'state', 'latitude','longitude']].copy()
zipcode_Ha = zipcode_Ha[zipcode_Ha.state == 'Hamburg']
zipcode_Ha.drop_duplicates(subset=['zipcode'], keep='first', inplace = True)
zipcode_Ha.reset_index(drop = True, inplace = True)
print(f'Shape: {zipcode_Ha.shape}')

#combine both data frames for Berlin and Hamburg
frames = [zipcode_Ber, zipcode_Ha]
zipcode_BerHa = pd.concat(frames)
zipcode_BerHa.reset_index(drop = True, inplace = True)
print(f'Schape of filtered data frame including all zipcodes for Berlin and Hamburg: {zipcode_BerHa.shape}')
zipcode_BerHa.head()

Shape: (195, 4)
Shape: (101, 4)
Schape of filtered data frame including all zipcodes for Berlin and Hamburg: (296, 4)


Unnamed: 0,zipcode,state,latitude,longitude
0,10115,Berlin,52.5323,13.3846
1,10117,Berlin,52.517,13.3872
2,10119,Berlin,52.5305,13.4053
3,10178,Berlin,52.5213,13.4096
4,10179,Berlin,52.5122,13.4164


In [17]:
#utilizing the Foursquare API to explore the neighborhoods and segment them
#first, defining Foursquare Credentials and Version
CLIENT_ID = 'your ID' # your Foursquare ID
CLIENT_SECRET = 'your secret' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: COKZBJR0SUAK03HFZ1C42SHXPMOVWIMZU31VXWQB05D3OUX1
CLIENT_SECRET:J3KCVMCDCFJHMBNBIRXAPP0Y0N0WY1NRZF1OFTSXSR5R2QIK


In [45]:
#function to retrieve first 100 avenues per zipcode in Berlin
LIMIT = 100

def getNearbyVenues(zipcode, state, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for zipcode, state, lat, lng in zip(zipcode, state, latitudes, longitudes):
        print(zipcode)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            zipcode, 
            state,
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Zipcode',
                  'State'
                  'Zipcode Latitude', 
                  'Zipcode Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#run the above function on each neighborhood and create a new dataframe called toronto_venues
venues_BerHa = getNearbyVenues(zipcode = zipcode_BerHa['zipcode'],
                                   state = zipcode_BerHa['state'],
                                   latitudes = zipcode_BerHa['latitude'],
                                   longitudes = zipcode_BerHa['longitude']
                                  )

10115


KeyError: 'groups'

In [None]:
#Venues in Berlin and Hamburg data frame overview
print(f'Shape: {venues_BerHa.shape}')
venues_BerHa.head()

In [None]:
#number of venues were returned for each zipcode in Berlin and Hamburg
venues_BerHa.groupby(['Zipcode'],as_index = False).count().head()

In [None]:
#analyzing each zipcode
#one hot encoding
onehot_BerHa = pd.get_dummies(venues_BerHa[['Venue Category']], prefix="", prefix_sep="")

# add zipcode column back to dataframe
onehot_BerHa['Zipcode'] = venues_BerHa['Zipcode'] 

# move neighborhood column to the first column
fixed_columns = [onehot_BerHa.columns[-1]] + list(onehot_BerHa.columns[:-1])
onehot_BerHa = onehot_BerHa[fixed_columns]

print(f'Shape: {onehot_BerHa.shape}')
onehot_BerHa.head()

In [None]:
#group rows by zipcode and by taking the mean of the frequency of occurrence of each category
grouped_BerHa = onehot_BerHa.groupby('Zipcode').mean().reset_index()
print(f'Shape: {grouped_BerHa.shape}')
grouped_BerHa

In [None]:
#First write a function to sort the venues in descending order
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

#create the new dataframe and display the top 10 venues for each neighborhood
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Zipcode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
zipcode_BerHa_venues_sorted = pd.DataFrame(columns=columns)
zipcode_BerHa_venues_sorted['Zipcode'] = grouped_BerHa['Zipcode']

for ind in np.arange(grouped_BerHa.shape[0]):
    zipcode_BerHa_venues_sorted.iloc[ind, 1:] = return_most_common_venues(grouped_BerHa.iloc[ind, :], num_top_venues)

print(f'Shape: {zipcode_BerHa_venues_sorted.shape}')
zipcode_BerHa_venues_sorted.sort_values(by=['Zipcode'], ascending=True)

In [148]:
#Use geopy library to get the latitude and longitude values of Berlin and Hamburg

address_Ber = 'Berlin'

geolocator_Ber = Nominatim(user_agent="ber_explorer")
location_Ber = geolocator_Ber.geocode(address_Ber)
latitude_Ber = location_Ber.latitude
longitude_Ber = location_Ber.longitude
print(f'The geograpical coordinate of Berlin are {latitude_Ber}, {longitude_Ber}.')


address_Ha = 'Hamburg'

geolocator_Ha = Nominatim(user_agent="ha_explorer")
location_Ha = geolocator_Ha.geocode(address_Ha)
latitude_Ha = location_Ha.latitude
longitude_Ha = location_Ha.longitude
print(f'The geograpical coordinate of Hamburg are {latitude_Ha}, {longitude_Ha}.')

The geograpical coordinate of Berlin are 52.5170365, 13.3888599.
The geograpical coordinate of Hamburg are 53.550341, 10.000654.


In [192]:
# create map for Berlin
map_Ber_clusters = folium.Map(location=[latitude_Ber, longitude_Ber], zoom_start=11)

kclusters = 1 #remove for clustering

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi in zip(zipcode_Ber['latitude'], zipcode_Ber['longitude'], zipcode_Ber['zipcode']): #for clustering add cluster here
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[cluster-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_Ber_clusters)
       
map_Ber_clusters

In [193]:

# create map for Hamburg
map_Ha_clusters = folium.Map(location=[latitude_Ha, longitude_Ha], zoom_start=11)

kclusters = 1 #remove for clustering

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi in zip(zipcode_Ha['latitude'], zipcode_Ha['longitude'], zipcode_Ha['zipcode']): #for clustering add cluster here
    label = folium.Popup(str(poi), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        #color=rainbow[cluster-1],
        fill=True,
        #fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_Ha_clusters)
       
map_Ha_clusters