<h1>Applied Data Science Capstone Project</h1>

<p>This notebook was put together for the IBM Data Science certificate capstone project. This first version will describe the business problem of the capstone and related data to solve problem.</p>

<h3>The business idea</h3>
<p>I have a friend who is considering on moving to the Capital area in Finland who has never lived in the area and is unaware of the neighbourhoods. Coming from rural areas he's a mainly concerned about about finding a location to live where: a) The pricing is not too expensive b) the available public transport to his new workplace in Otaniemi would be relative fast (45 minutes max by public transport) c) the place is not too restless and unsafe d) there would be parks, gardens, cafeterias etc for unwinding on free time.</p>

<p>As the capital area is relatively large there are a lot of different neighbourhoods to choose from. There are certain "bad" neighbourhoods where the crimes are more common. These neighbourhoods usually also are either located in the transit hubs or have more bars etc in the areas. The speed of the public transportation also varies quite a bit depending on how many transit changes are required and how close the start and end points are to the main hubs. While the prices in the whole capital area are quite high in relation to other areas there is also a lot of difference across the neighbourhoods. The more quiet areas with more parks and less density of population are usually also quite far in terms of public transport.</p>

<p>It would be interesting to have a service that would recommend neighbourhoods based on this criteria when planning to purchase an apartment. Personally I spent around 8 months hunting for an apartment and trying to figure out the neighbourhoods. </p>

<p>Since the easiest way to find locations for neighbourhoods in FourSquare and other data sources includes postal codes we will mainly focus on different postal code areas. The actual code areas can include several neighbourhoods inside the area but this will be the precision we're aiming for in this project.</p>

<h3>The data</h3>
<p>Solving this problem requires quite a few separate services to obtain the data. The main types of data that we will need to solve the problem are: location data with different venues, dataset or a service to build the data set from the distances and estimated travel, average prices data for the neighbourhoods.</p>

<p><b>Venue location data</b> will come from FourSquare the data will be used to evaluate the "restlessness" or how lively the neighbourhood is and to obtain the most common types of venues and trying to make some recommendations based on the most desired venue types.</p>

<p>We can also obtain some <b>living related data</b> arranged by postal codes from Statistics Finland. The dataset and its features can be found <a href="https://www.stat.fi/tup/paavo/tietosisalto_ja_esimerkit.html">here</a>. The data includes median income, unemployment and the living density</p>

<p><b>The average prices data</b> will be acquired from Statistics Finland and the obtained csv will have to be wrangled a bit to produce the average price of the apartments. The data will be used for ranking the neighbourhoods.</p>

<p><b>The public transport data</b> will be acquired from HSL which is the public transport provider for the whole capital area. The collected data will be used to rank the neighbourhoods based on the speed of the transport and to restrict neighbourhoods that are too far from the home to work transit time requirements.</p>

<p>To visualise the mentioned data on the map we will also need to obtain the geojson borders of the neighbourhoods in the areas. We can obtain the data from HSY Web Features service at <a href="https://kartta.hsy.fi/geoserver/wfs">https://kartta.hsy.fi/geoserver/wfs.</a></p>


First lets import all the dependencies.

In [963]:
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim
import folium
import requests
from sklearn.cluster import KMeans
import json

import matplotlib.cm as cm
import matplotlib.colors as colors
import branca.colormap as cm

import geopandas as gp

import pyproj
from pyproj import CRS

from python_graphql_client import GraphqlClient

<b>1. The locations</b>

To begin solving the problem lets print out the map of the Capital to visualise the area we are discussing about. We can use geopys Notinatim API to to first acquire the location of the mainarea and the workplace and the we can use Folium to print the area on map and mark the workplace there.

In [145]:
# Use a neighbourhood a bit north of the actual city center to center the map on more on the mainland
address_kannelmaki = 'Kannelmäki, Helsinki, Finland'

geolocator = Nominatim(user_agent="helsinki_explorer")
location_kannelmaki = geolocator.geocode(address_kannelmaki)
latitude_kannelmaki = location_kannelmaki.latitude
longitude_kannelmaki = location_kannelmaki.longitude
print('The geograpical coordinates of {} is latitude: {} and longitude: {}.'.format(address_kannelmaki, latitude_kannelmaki, longitude_kannelmaki))

# Use a neighbourhood a bit north of the actual city center to center the map on more on the mainland
address_workplace = 'Innopoli 3, Espoo, Finland'

geolocator = Nominatim(user_agent="helsinki_explorer")
location_workplace = geolocator.geocode(address_workplace)
latitude_workplace = location_workplace.latitude
longitude_workplace = location_workplace.longitude
print('The geograpical coordinates of {} is latitude: {} and longitude: {}.'.format(address_workplace, latitude_workplace, longitude_workplace))


The geograpical coordinates of Kannelmäki, Helsinki, Finland is latitude: 60.2436076 and longitude: 24.8832893.
The geograpical coordinates of Innopoli 3, Espoo, Finland is latitude: 60.1881158 and longitude: 24.80870540096568.


In [148]:
# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude_kannelmaki, longitude_kannelmaki], zoom_start=10)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

<b>2. Postal code areas, the neighbourhoods and the corresponding geojson</b>

You can read more about geojson on <a href="https://en.wikipedia.org/wiki/GeoJSON">wikipedia</a> but basically the geojson is just standard for representing geographical features. The geojson can be of several types like points, polygons, lines etc. For this project we are interested in using geojson polygons to render the boundaries of each postal area.

The geojson can contain some extra properties, f.ex here it contains the neighbourhood names, the postal codes etc. The geojson also contains the type of the geometry and the corresponding coordinates.

Next lets load the geojson for neighbourhoods in the capital region. The geojson has been attained from HSY's web features service and converted into geojson. The data has been included in the repository so we can just load it. We use geopandas to load the data. 

In [659]:
# Read the file with geopandas
neighbourhoods = gp.read_file('postinro_alue.geojson')

# Print out first five rows of the data for inspection
neighbourhoods.head()

Unnamed: 0,gml_id,posno,toimip,toimip_ru,nimi,nimi_ru,kunta,kunta_nro,geometry
0,pks_postinumeroalueet_2020.1,100,HELSINKI,HELSINGFORS,Helsinki Keskusta - Etu-Töölö,Helsingfors centrum - Främre Tölö,Helsinki,91,"MULTIPOLYGON (((25495415.010 6673755.420, 2549..."
1,pks_postinumeroalueet_2020.2,120,HELSINKI,HELSINGFORS,Punavuori,Rödbergen,Helsinki,91,"MULTIPOLYGON (((25496720.730 6672703.770, 2549..."
2,pks_postinumeroalueet_2020.3,130,HELSINKI,HELSINGFORS,Kaartinkaupunki,Gardesstaden,Helsinki,91,"MULTIPOLYGON (((25496776.230 6672752.055, 2549..."
3,pks_postinumeroalueet_2020.4,140,HELSINKI,HELSINGFORS,Kaivopuisto - Ullanlinna,Brunnsparken - Ulrikasborg,Helsinki,91,"MULTIPOLYGON (((25497132.180 6672015.420, 2549..."
4,pks_postinumeroalueet_2020.5,150,HELSINKI,HELSINGFORS,Eira - Hernesaari,Eira - Ärtholmen,Helsinki,91,"MULTIPOLYGON (((25496970.120 6671136.315, 2549..."


From the geometry we can also see that the data is in EPSG:3879 format. To use the geometries in Folium we need to project the geometries to EPSG:4326.

In [660]:
# Define the crs of the dataframe
neighbourhoods.crs = CRS.from_epsg(3879)

# Project to EPSG:4326
neighbourhoods = neighbourhoods.to_crs(epsg=4326)

# Print the first five rows for inspection
neighbourhoods.head()

Unnamed: 0,gml_id,posno,toimip,toimip_ru,nimi,nimi_ru,kunta,kunta_nro,geometry
0,pks_postinumeroalueet_2020.1,100,HELSINKI,HELSINGFORS,Helsinki Keskusta - Etu-Töölö,Helsingfors centrum - Främre Tölö,Helsinki,91,"MULTIPOLYGON (((24.91739 60.17664, 24.91766 60..."
1,pks_postinumeroalueet_2020.2,120,HELSINKI,HELSINGFORS,Punavuori,Rödbergen,Helsinki,91,"MULTIPOLYGON (((24.94093 60.16721, 24.94107 60..."
2,pks_postinumeroalueet_2020.3,130,HELSINKI,HELSINGFORS,Kaartinkaupunki,Gardesstaden,Helsinki,91,"MULTIPOLYGON (((24.94193 60.16764, 24.95107 60..."
3,pks_postinumeroalueet_2020.4,140,HELSINKI,HELSINGFORS,Kaivopuisto - Ullanlinna,Brunnsparken - Ulrikasborg,Helsinki,91,"MULTIPOLYGON (((24.94835 60.16103, 24.94846 60..."
4,pks_postinumeroalueet_2020.5,150,HELSINKI,HELSINGFORS,Eira - Hernesaari,Eira - Ärtholmen,Helsinki,91,"MULTIPOLYGON (((24.94545 60.15314, 24.94237 60..."


Lets format the data frame a bit and also translate the names into english.

In [661]:
# We can drop some columns we don't need
neighbourhoods = neighbourhoods.drop(labels=['toimip_ru', 'nimi_ru', 'gml_id', 'kunta_nro', 'toimip'], axis = 1)

# Lets also rename and translate the columns
neighbourhoods.columns = ['PostalCode', 'Neighbourhood', 'Municipality', 'geometry']

# Lets print out the dataframe to see the applied changes
neighbourhoods.head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,"MULTIPOLYGON (((24.91739 60.17664, 24.91766 60..."
1,120,Punavuori,Helsinki,"MULTIPOLYGON (((24.94093 60.16721, 24.94107 60..."
2,130,Kaartinkaupunki,Helsinki,"MULTIPOLYGON (((24.94193 60.16764, 24.95107 60..."
3,140,Kaivopuisto - Ullanlinna,Helsinki,"MULTIPOLYGON (((24.94835 60.16103, 24.94846 60..."
4,150,Eira - Hernesaari,Helsinki,"MULTIPOLYGON (((24.94545 60.15314, 24.94237 60..."


Examine the dimensions of the dataframe.

In [662]:
neighbourhoods.shape

(172, 4)

<b>3. Transit data via HSL</b>

To calculate the distances to different postal code areas we need to access the HSL GraphQL API. The documentation for the API is available <a href="https://digitransit.fi/en/developers/apis/1-routing-api/x-advanced/">here</a>. We will use python_graphql_client library for querying the API. The postal code areas can be somewhat large and differ between the actual apartment destination, but we will just cut some corners and estimate based on whatever coordinates the Nominatim library provides for the postal code.

Lets first setup the client and the query for loading the transit data. Since we are estimating the time to workplace then lets assume that we want to get there around the rush hour. Lets plan the routes to start on monday at eight in the morning.

In [236]:
# Setup endpoint to Helsinki area graphql api
endpoint = "https://api.digitransit.fi/routing/v1/routers/hsl/index/graphql"
# Setup the client
client = GraphqlClient(endpoint=endpoint)

# Define the query
query = """
    query planQuery($start: InputCoordinates, $end: InputCoordinates) {
        plan(
            from: $start
            to: $end
            date: "2021-06-07"
            time: "08:00:00"
        ) {
            itineraries {
                duration
            }
        }
    }
"""

Then lets define some test location and use it in the query to validate that the query is working and to check how the data looks.

In [259]:
# Get the location for the first postcode in the dataframe
address_hsy_test = '00100, Finland'

geolocator = Nominatim(user_agent="helsinki_explorer")
location_hsy_test = geolocator.geocode(address_hsy_test)
latitude_hsy_test = location_hsy_test.latitude
longitude_hsy_test = location_hsy_test.longitude

# Setup some test variables
variables = { "start": {"lat": latitude_hsy_test, "lon": longitude_hsy_test}, "end": {"lat": latitude_workplace, "lon": longitude_workplace}}

# Execute the query
data = client.execute(query=query, variables=variables)

routes = data['data']['plan']['itineraries'];

# Print the data for inspection
print(routes)

[{'duration': 1979}, {'duration': 1979}, {'duration': 1919}]


Looking at the data we can see that for the plan it returns several different route options with varying durations defined in seconds.

In [266]:
# Check the shortest duration, floor to minutes
min(map(lambda route: route['duration'], routes)) // 60

31

<p>
Now lets form the durations dataframe. We will take the postal codes from the neighbourhoods dataframe and use it for looping over the postal codes. Then we will get the postal code using Nominatim and then use the location for querying the data from HSY and append the postal code, lon, lat and the duration into a dataframe.
</p>
<p>
Since the loading will take a long while we will also later save the data into a csv for safety. Also we are somewhat in the mercy of the HSY API so will check the routes that cannot be found and just ignore those postal codes for this project. To avoid the routes not being found the position of the geospatial coordinates could be more intelligent and not just a "guess".
</p>


In [270]:
# Create new dataframe for durations data
durations = pd.DataFrame(columns=['PostalCode', 'Duration', 'Longitude', 'Latitude'])

# Create a list from the data for looping
codes = neighbourhoods['PostalCode']

for code in codes:
    # Get the geospatial coordinates for the postal code
    address = code + ', Finland'
    geolocator = Nominatim(user_agent="helsinki_explorer")
    hsy_location = geolocator.geocode(address)
    hsy_latitude = hsy_location.latitude
    hsy_longitude = hsy_location.longitude
    
    # Setup variables for the query
    variables = { "start": {"lat": hsy_latitude, "lon": hsy_longitude}, "end": {"lat": latitude_workplace, "lon": longitude_workplace}}

    # Execute the query
    data = client.execute(query=query, variables=variables)
    
    # Extract the itineraries
    routes = data['data']['plan']['itineraries']
    if (len(routes) == 0):
        print('Unable to find routes for code', code)
        continue
        
    # Calculate the minimum duration and convert to minutes
    duration = min(map(lambda route: route['duration'], routes)) // 60
    
    # Append to dataframe
    durations = durations.append(
        { 'PostalCode': code, 'Duration': duration, 'Latitude': hsy_latitude, 'Longitude': hsy_longitude },
        ignore_index=True)
    
# Inspect the data
durations.head()

Unable to find routes for code 00310




Unable to find routes for code 01800
Unable to find routes for code 02290




Unable to find routes for code 02980


Unnamed: 0,PostalCode,Duration,Longitude,Latitude
0,100,31,24.933727,60.169989
1,120,42,24.939202,60.163562
2,130,44,24.947547,60.165009
3,140,52,24.952425,60.158122
4,150,44,24.938014,60.158939


In [272]:
# Add the dataframe for safety, in case the queries fail you could just load the data from the repo
# durations.to_csv('postinro_durations.csv', index=False)
# durations = pd.read_csv('postinro_durations.csv')

Seems like the API could not find routes for a few of the routes. Lets investigate a bit. To see how much data we got in comparison to the original post codes data. We will ignore the postal codes for this project.

In [274]:
# Check the obtained durations data dimensions
print(durations.shape)
# Check the neighbourhoods data dimensions
print(neighbourhoods.shape)

(168, 4)
(172, 4)


<b>4. Average apartment prices data</b>

The price data has been exported from Statistics Finland service. The prices includes apartment prices for 2020 for all types and ages of apartments. The last row for each postal code includes the average of all types and ages of apartments sold weighed by the amount sales.

First lets check how the average prices csv looks.

In [309]:
# Open the file
with open('asuntojen_hinnat_2020_statfi_utf8.csv') as f:
    # Read the first 5 rows of the data
    head = [next(f) for x in range(5)]

    # Print out the head
    for row in head:
        print(row)

112q -- Vanhojen osakeasuntojen keskihinnat ja kauppojen lukumäärät postinumeroalueittain ja rakennusvuosittain, 2010-2020;;;

;;;

Postinumero;Talotyyppi;Rakennusvuosi;2020 Neliöhinta (EUR/m2)

00100 Helsinki Keskusta - Etu-Töölö   (Helsinki );Kerrostalo yksiöt;-1949;..

00100 Helsinki Keskusta - Etu-Töölö   (Helsinki );Kerrostalo yksiöt;1950-1959;..



Looking at the rows we can see that the resulting csv is not very clean so first we have to load the csv and clean it a bit to be able to store it into a pandas dataframe. There is quite many rows for each of the postal codes and we're only interested in the last row so we will use a dict for containing the postal codes and the data and loop over the data so that the final value for each postal code will be the last row.

In [310]:
# Prepare the prices dict for collecting the price data
prices = {}

# Open the prices file
with open('asuntojen_hinnat_2020_statfi_utf8.csv') as f:
    lines = f.readlines()
    # Loop over the lines skipping the 3 header rows
    for line in lines[3:]:
        # Split the csv row
        row = line.split(';')

        # Since rows with no sales per type / age of the apartment is marked as '..' we can just ignore the row
        if('..' in row[-1]):
            continue

        # Extract the postal code from the first item in the row
        postal_code = row[0].split(' ')[0]

        # Set the price for a postal on the dict (only last value will be remembered for each postal code area)
        prices[postal_code] = row[-1].replace('\n', '')

# Initialise the dataframe
prices_data = pd.DataFrame(columns=['PostalCode', 'AveragePrice'])

# Collect the data into the data frame
for code in prices.keys():
    prices_data = prices_data.append(
        { 'PostalCode': code, 'AveragePrice': prices[code] },
        ignore_index=True)
    
# Print the first five rows of the data
prices_data.head()

Unnamed: 0,PostalCode,AveragePrice
0,100,7587
1,120,8182
2,130,7855
3,140,8712
4,150,8401


<b>5. Living indicators based on postal code</b>

Next we will prepare the more generic living related data obtained from Statistics Finland. The data set has been downloaded from the website. We will first investigate the data set a bit and then perform some operation on the data set to clean it up a bit and drop column we don't need and also drop rows with no data.

In [889]:
# Read the data from the csv file
indicators_data = pd.read_csv('tilastokeskus_postinumero_tietoa_utf8.csv', skiprows=2, sep=';')
# Print head of the data set for investigation
indicators_data.head()

Unnamed: 0,Postinumeroalue,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2018 (HE)","Asukkaiden keskitulot, 2017 (HR)","Asukkaiden mediaanitulot, 2017 (HR)","Asumisväljyys, 2018 (TE)","Taloudet yhteensä, 2017 (TR)","Työttömät, 2017 (PT)"
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),2353278,18427,42196,27577,38.7,10205,702
1,00120 Punavuori (Helsinki),414010,7161,41657,27523,39.5,3933,273
2,00130 Kaartinkaupunki (Helsinki),428960,1523,57766,30479,43.0,818,41
3,00140 Kaivopuisto - Ullanlinna (Helsinki),931841,7921,53555,29439,41.3,4404,261
4,00150 Eira - Hernesaari (Helsinki),1367328,9385,41564,26546,34.3,5759,438


In [890]:
# Print the dimensions of the data
indicators_data.shape

(199, 8)

Lets first translate and rename the columns.

In [891]:
# Rename columns
indicators_data.columns = ['PostalCode', 'AreaSize', 'Population', 'AverageIncome', 'MedianIncome', 'LivingSpace', 'Households', 'Unemployed']
# Examine the changes
indicators_data.head()

Unnamed: 0,PostalCode,AreaSize,Population,AverageIncome,MedianIncome,LivingSpace,Households,Unemployed
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),2353278,18427,42196,27577,38.7,10205,702
1,00120 Punavuori (Helsinki),414010,7161,41657,27523,39.5,3933,273
2,00130 Kaartinkaupunki (Helsinki),428960,1523,57766,30479,43.0,818,41
3,00140 Kaivopuisto - Ullanlinna (Helsinki),931841,7921,53555,29439,41.3,4404,261
4,00150 Eira - Hernesaari (Helsinki),1367328,9385,41564,26546,34.3,5759,438


Now that we have the data translated we can see that we have some data that we don't really need. Lets select the columns we will use.

In [892]:
# Perform selection and drop the unneeded columns
indicators_data = indicators_data[['PostalCode', 'Population', 'MedianIncome', 'LivingSpace', 'Unemployed']]
# Examine data
indicators_data.head()

Unnamed: 0,PostalCode,Population,MedianIncome,LivingSpace,Unemployed
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),18427,27577,38.7,702
1,00120 Punavuori (Helsinki),7161,27523,39.5,273
2,00130 Kaartinkaupunki (Helsinki),1523,30479,43.0,41
3,00140 Kaivopuisto - Ullanlinna (Helsinki),7921,29439,41.3,261
4,00150 Eira - Hernesaari (Helsinki),9385,26546,34.3,438


The postal code column seems to contain some junk text, lets format the cell contents a bit to be later able to merge the data set with other data.

In [893]:
# Use regex and clean up the postal codes
indicators_data['PostalCode'] = indicators_data['PostalCode'].replace(' .*', '', regex=True)
# View the results
indicators_data.head()

Unnamed: 0,PostalCode,Population,MedianIncome,LivingSpace,Unemployed
0,100,18427,27577,38.7,702
1,120,7161,27523,39.5,273
2,130,1523,30479,43.0,41
3,140,7921,29439,41.3,261
4,150,9385,26546,34.3,438


The data also contains some incomplete data that are represented with '..' so lets first convert those strings into nan and then drop the nan rows.

In [894]:
# Replace .. strings with nan
indicators_data = indicators_data.replace('\.\.', np.nan, regex=True)
# Drop nan rows
indicators_data = indicators_data.dropna()
# Check the results
indicators_data.head()

Unnamed: 0,PostalCode,Population,MedianIncome,LivingSpace,Unemployed
0,100,18427,27577,38.7,702
1,120,7161,27523,39.5,273
2,130,1523,30479,43.0,41
3,140,7921,29439,41.3,261
4,150,9385,26546,34.3,438


In [895]:
# Check the dimensions of the data
indicators_data.shape

(198, 5)

Looks like there was one row with nan data that was dropped. Next lets use the population and the unemployed count to create a new column with unemployment rate.

In [896]:
# Create unemployment rate column from the population count and the unemployed count
indicators_data['UnemploymentRate'] = indicators_data['Unemployed'].astype('int') / indicators_data['Population'].astype('int')
indicators_data.head()

Unnamed: 0,PostalCode,Population,MedianIncome,LivingSpace,Unemployed,UnemploymentRate
0,100,18427,27577,38.7,702,0.038096
1,120,7161,27523,39.5,273,0.038123
2,130,1523,30479,43.0,41,0.026921
3,140,7921,29439,41.3,261,0.03295
4,150,9385,26546,34.3,438,0.04667


Lets drop the population and unemployed counts now that we have the rates

In [897]:
# Drop population and unemployed columns
indicators_data.drop(labels=['Population', 'Unemployed'], axis=1, inplace=True)
# Check the results
indicators_data.head()

Unnamed: 0,PostalCode,MedianIncome,LivingSpace,UnemploymentRate
0,100,27577,38.7,0.038096
1,120,27523,39.5,0.038123
2,130,30479,43.0,0.026921
3,140,29439,41.3,0.03295
4,150,26546,34.3,0.04667


Lets also check the data types and convert MedianIncome to integer.

In [898]:
# Check the types
indicators_data.dtypes

PostalCode           object
MedianIncome         object
LivingSpace         float64
UnemploymentRate    float64
dtype: object

In [899]:
# Cast the MedianIncome into int
indicators_data['MedianIncome'] = indicators_data['MedianIncome'].astype(int)
# Check the types again
indicators_data.dtypes

PostalCode           object
MedianIncome          int64
LivingSpace         float64
UnemploymentRate    float64
dtype: object

<b>6. Venues data from FourSquare</b>

The last source of data we will be using will be FourSquare. We will use the FourSquare API for fetching nearby venues data. The data will be used in clustering the postal code areas and to evaluate how well the postal code area would fit the requirements of the person considering to live in the area.

First lets setup the credentials for accessing the data.

In [476]:
# Removed from github for safety
CLIENT_ID = ''
CLIENT_SECRET = ''
VERSION = '20180605'
LIMIT = 100

Then lets borrow some functions from the course for fetching the venue data.

In [468]:
def getNearbyVenues(codes, latitudes, longitudes, radius=500):
    venues_list=[]
    for code, lat, lng in zip(codes, latitudes, longitudes):
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            code, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['PostalCode', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[1:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

Lets fetch the venues

In [469]:
# Call the function to fetch venues
capital_venues = getNearbyVenues(
    names = durations['PostalCode'], 
    latitudes = durations['Latitude'], 
    longitudes = durations['Longitude'])

In [470]:
# Check the data
capital_venues.head()

Unnamed: 0,PostalCode,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,100,60.169989,24.933727,Cafe Rouge,60.168711,24.933027,Middle Eastern Restaurant
1,100,60.169989,24.933727,Pobre,60.1695,24.933484,Filipino Restaurant
2,100,60.169989,24.933727,Amos Rex,60.170643,24.936529,Art Museum
3,100,60.169989,24.933727,Futurice,60.168766,24.9343,IT Services
4,100,60.169989,24.933727,Helsingin Astanga joogakoulu,60.168128,24.936061,Yoga Studio


In [471]:
# Check dimensions of the data
capital_venues.shape

(2438, 7)

Now that we have the data we can use one hot encoding and calculate the frequencies of the venues on the postal code area.

In [472]:
# one hot encoding
capital_onehot = pd.get_dummies(capital_venues[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
capital_onehot['PostalCode'] = capital_venues['PostalCode'] 

# move neighborhood column to the first column
fixed_columns = [capital_onehot.columns[-1]] + list(capital_onehot.columns[:-1])
capital_onehot = capital_onehot[fixed_columns]

# Check the data
capital_onehot.head()

Unnamed: 0,PostalCode,ATM,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,100,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Lets calculate the frequencies of the data

In [473]:
capital_grouped = capital_onehot.groupby('PostalCode').mean().reset_index()
capital_grouped.head()

Unnamed: 0,PostalCode,ATM,Airport Terminal,American Restaurant,Antique Shop,Art Gallery,Art Museum,Arts & Crafts Store,Asian Restaurant,Athletics & Sports,...,Vietnamese Restaurant,Warehouse Store,Water Park,Waterfront,Wine Bar,Wine Shop,Wings Joint,Women's Store,Yoga Studio,Zoo
0,100,0.0,0.0,0.0,0.0,0.02,0.03,0.01,0.01,0.0,...,0.01,0.0,0.0,0.0,0.03,0.01,0.0,0.0,0.01,0.0
1,120,0.0,0.0,0.033898,0.0,0.0,0.016949,0.0,0.0,0.0,...,0.033898,0.0,0.0,0.0,0.016949,0.0,0.0,0.0,0.0,0.0
2,130,0.0,0.0,0.025,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,140,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.022222,0.0,0.0,0.0,0.0,0.0,0.0
4,150,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.025641,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now that we have the frequency data we can use it to sort the venues per postal code to see most common types of venues in the postal code area.

In [475]:
# Define number of the most common venues to list
num_top_venues = 5

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['PostalCode']
for ind in np.arange(num_top_venues):
    try:
        columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['PostalCode'] = capital_grouped['PostalCode']

for ind in np.arange(capital_grouped.shape[0]):
    neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(capital_grouped.iloc[ind, :], num_top_venues)

# Check the resulting data
neighborhoods_venues_sorted.head()

Unnamed: 0,PostalCode,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,100,Scandinavian Restaurant,Sushi Restaurant,Rock Club,Chinese Restaurant,Art Museum
1,120,Scandinavian Restaurant,Sushi Restaurant,Cocktail Bar,Pizza Place,Grocery Store
2,130,Scandinavian Restaurant,Pizza Place,Coffee Shop,Hotel,Park
3,140,Park,Coffee Shop,Ice Cream Shop,Grocery Store,Scandinavian Restaurant
4,150,Pizza Place,Coffee Shop,Scandinavian Restaurant,Bakery,French Restaurant


<h3>Methodology</h3>

In this chapter we will use the acquired data to first visually inspect the data using various methods such as the creating different choropleth maps from data to explore the data and drive our main investigation.

The course of the analysis goes as follows
- Visually inspect the postal code areas on map
    - Include the various properties on each of the postal code areas
- Visualise the distances using the choropleth map based on the expected public transport duration
- Visualise the average apartment prices on the map
- Visualise the some of the other living factors on map
- Filter out areas that are not within the maximum defined public transport duration
- Filter out areas that are too expensive
- Explore the other properties
- Using k-means cluster the remaining data

<b>1. Visualise public transport durations data on the map</b>

Now we have all the data lets start to reduce the number of postal code areas to find the most interesting ones. And also visualise the data on the to drive our decisions.

In [998]:
# Merge the neighbourhoods with the travel durations data
durations_merged = neighbourhoods.merge(durations, how="right", on="PostalCode")

In [999]:
# Setup the linear color map for the choropleth
colormap = cm.linear.YlOrRd_08.to_step(data=durations_merged['Duration'], index=[10, 20, 30, 40, 50, 60, 70, 80, 90])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude_kannelmaki, longitude_kannelmaki], zoom_start=10)

# Setup tile layer
folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)

# Add caption
colormap.caption = "Estimated travel duration in minutes"

# Setup styling function
style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['Duration']) else colormap(x['properties']['Duration']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

# Setup highlight function
highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

# Define the actual map
GeoJSON =folium.features.GeoJson(
        durations_merged,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'Duration'],
            aliases=['Postal code', 'Neighbourhood', 'Estimated transit duration in minutes'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

# Add the map items to the map
colormap.add_to(map_uusimaa)
map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

<p>The work place is a bit off the usual routes for public transport. In the Otaniemi and Tapiola regions there are metro stops, but those are within a short walking distance from the office. The nearest train station is in Leppävaara and one needs to take a bus from the station to reach office. The for some of the closer areas there are also direct busses but the main transport from further away is using a metro or the train. Having to switch several transportation methods will obviously affect the duration which we can also see on the map.</p>

<p>The fastest areas are the ones where there is direct bus access. We can see that the neighbouring postal code areas are the fastest. Then also the ones along some main bus routes following Kehä 1 road, the road going to meilahti via Lehtisaari-Kuusisaari region and the road south to Westend.</p>

<p>Next thing we can observe is the differences between the metro and train stops compared to their neighbouring postal code areas without the metro or train stops. For example we can follow the metro line to east through the city center and observe how much faster it is to reach those postal code areas versus the postal code areas that require several transit changes.</p>

<p>We can immediately drop the areas that are too far in terms of transport duration to reduce the areas to consider for purchasing an apartment in.</p>

In [1033]:
# Check the data dimensions before dropping data
durations_merged.shape

(168, 7)

In [1000]:
# Drop the areas from the data set where duration is 45 minutes or more.
durations_filtered = durations_merged[durations_merged['Duration'] < 45]
# Check the dimensions
durations_filtered.shape

(70, 7)

Restricting the areas by travel duration made already quite significant reduction in the areas to consider. Lets re-center the map a bit and zoom in closer. Lets also re-draw the map to visualise the remaining areas.

In [None]:
address = '02130, Espoo, Finland'

geolocator = Nominatim(user_agent="helsinki_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude

In [1008]:
# colormap = cm.linear.YlGnBu_09.to_step(data=filtered_data_indicators['LivingSpace'], method='quant', quantiles=[0,0.1,0.75,0.9,0.98,1])
colormap = cm.linear.YlOrRd_08.to_step(data=durations_merged['Duration'], index=[10, 20, 30, 40, 50, 60, 70, 80, 90])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Estimated travel duration in minutes"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['Duration']) else colormap(x['properties']['Duration']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        durations_filtered,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'Duration'],
            aliases=['Postal code', 'Neighbourhood', 'Estimated transit duration in minutes'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

Next lets merge the prices data to the data set and visualise the average apartment prices on the left over map.

In [1003]:
# Merge the prices to the data set
prices_data_with_durations = durations_filtered.merge(prices_data, how="left", on="PostalCode")
# Check the data
prices_data_with_durations.head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,"MULTIPOLYGON (((24.91739 60.17664, 24.91766 60...",31,24.933727,60.169989,7587
1,120,Punavuori,Helsinki,"MULTIPOLYGON (((24.94093 60.16721, 24.94107 60...",42,24.939202,60.163562,8182
2,130,Kaartinkaupunki,Helsinki,"MULTIPOLYGON (((24.94193 60.16764, 24.95107 60...",44,24.947547,60.165009,7855
3,150,Eira - Hernesaari,Helsinki,"MULTIPOLYGON (((24.94545 60.15314, 24.94237 60...",44,24.938014,60.158939,8401
4,170,Kruununhaka,Helsinki,"MULTIPOLYGON (((24.94344 60.17773, 24.94935 60...",42,24.955321,60.171866,7854


In [1004]:
# Check the dimensions
prices_data_with_durations.shape

(70, 8)

In [1005]:
# The resulting data contains some nan values, lets drop the rows
prices_data_with_durations = prices_data_with_durations.dropna()
# Check the dimensions again
prices_data_with_durations.shape

(67, 8)

In [1006]:
# Lets convert the average price to int
prices_data_with_durations['AveragePrice'] = prices_data_with_durations['AveragePrice'].astype(int)
prices_data_with_durations.dtypes

PostalCode         object
Neighbourhood      object
Municipality       object
geometry         geometry
Duration           object
Longitude         float64
Latitude          float64
AveragePrice        int64
dtype: object

Now with having done that we can visualise average prices on the map.

In [1012]:
colormap = cm.linear.YlOrRd_08.to_step(data=prices_data_with_durations['AveragePrice'], index=[2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Average apartment prices for postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['AveragePrice']) else colormap(x['properties']['AveragePrice']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        prices_data_with_durations,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'AveragePrice'],
            aliases=['Postal code', 'Neighbourhood', 'Average apartment prices for postal areas'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

We can see that on the remaining data set there are some quite expensive postal code areas mainly around the center of Helsinki. Since those are too expensive neighbourhoods for us lets drop the areas where average price is more than 5000 euros per square meter.

In [1013]:
# Drop the areas with average price over 5000e
prices_data_with_durations_filtered = prices_data_with_durations[prices_data_with_durations['AveragePrice'] < 5000]
# Check the dimensions
prices_data_with_durations_filtered.shape

(42, 8)

We can see again quite significant reduction in areas so lets re-render the map again to see the changes.

In [1014]:
# colormap = cm.linear.YlGnBu_09.to_step(data=filtered_data_indicators['LivingSpace'], method='quant', quantiles=[0,0.1,0.75,0.9,0.98,1])
colormap = cm.linear.YlOrRd_08.to_step(data=prices_data_with_durations_filtered['AveragePrice'], index=[2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Average apartment prices for postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['AveragePrice']) else colormap(x['properties']['AveragePrice']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        prices_data_with_durations_filtered,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'AveragePrice'],
            aliases=['Postal code', 'Neighbourhood', 'Average apartment prices for postal areas'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

We can see that we have now more affordable areas left. The original neighbourhoods data is already quite well reduced. Lets try to evaluate some other interesting living related indicators and try to reduce the remaining data set further.

In [1015]:
# Merge the indicators data to the data set
filtered_data_indicators = prices_data_with_durations_filtered.merge(indicators_data, how="left", on="PostalCode")
# Check the new data
filtered_data_indicators.head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice,MedianIncome,LivingSpace,UnemploymentRate
0,340,Kuusisaari-Lehtisaari,Helsinki,"MULTIPOLYGON (((24.86319 60.17061, 24.86321 60...",16,24.856462,60.18357,4243,33289,53.0,0.028297
1,350,Munkkivuori-Niemenmäki,Helsinki,"MULTIPOLYGON (((24.88529 60.21411, 24.88553 60...",28,24.871192,60.205799,4995,24178,31.9,0.047845
2,360,Pajamäki,Helsinki,"MULTIPOLYGON (((24.86102 60.21374, 24.86042 60...",35,24.852486,60.216928,4585,23575,30.8,0.063169
3,370,Reimarla,Helsinki,"MULTIPOLYGON (((24.83489 60.21853, 24.83377 60...",23,24.853725,60.224847,3496,23334,33.2,0.053331
4,390,Konala,Helsinki,"MULTIPOLYGON (((24.85757 60.23095, 24.85722 60...",39,24.848279,60.240347,3520,24086,33.4,0.05745


We have some new data on the set. Lets quickly go over them. We now have median income, the average living space in square meters and the unemployment rates of the postal code areas. We can visualise these indicators on the map and try to limit the data set a bit further along our preferences.

In [1016]:
# Check the data dimensions
filtered_data_indicators.shape

(42, 11)

Lets begin by making the employment rate more easily understandable by multiplying it by 100

In [1018]:
# Multiply unemployment rates by 100
filtered_data_indicators['UnemploymentRate'] = filtered_data_indicators['UnemploymentRate'] * 100
# Check the data
filtered_data_indicators.head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice,MedianIncome,LivingSpace,UnemploymentRate
0,340,Kuusisaari-Lehtisaari,Helsinki,"MULTIPOLYGON (((24.86319 60.17061, 24.86321 60...",16,24.856462,60.18357,4243,33289,53.0,2.829655
1,350,Munkkivuori-Niemenmäki,Helsinki,"MULTIPOLYGON (((24.88529 60.21411, 24.88553 60...",28,24.871192,60.205799,4995,24178,31.9,4.784483
2,360,Pajamäki,Helsinki,"MULTIPOLYGON (((24.86102 60.21374, 24.86042 60...",35,24.852486,60.216928,4585,23575,30.8,6.316916
3,370,Reimarla,Helsinki,"MULTIPOLYGON (((24.83489 60.21853, 24.83377 60...",23,24.853725,60.224847,3496,23334,33.2,5.33313
4,390,Konala,Helsinki,"MULTIPOLYGON (((24.85757 60.23095, 24.85722 60...",39,24.848279,60.240347,3520,24086,33.4,5.745008


Lets start evaluating the indicators by visualising the unemployment rates of the postal code areas on the map

In [1020]:
colormap = cm.linear.YlOrRd_08.to_step(data=filtered_data_indicators['UnemploymentRate'], index=[0, 2, 4, 6, 8, 10])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Unemployment rates for the postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['UnemploymentRate']) else colormap(x['properties']['UnemploymentRate']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'UnemploymentRate'],
            aliases=['Postal code', 'Neighbourhood', 'Unemployment rate'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

There are some differences in the unemployment rates over the areas. We can see that Otaniemi and Westend have quite small percentage of unemployed residents. Otaniemi is quite oddity in most of the statistics because it has very large population of students on the area due to a large technical university campus. Westend is also one of the wealthier areas in the capital region. There are a few areas with larger unemployment rates than in the rest of the areas. These are Pajamäki, Pähkinärinne, Maunula-Suursuo and Kannelmäki. Lets drop the areas with unemployment rates over six percent from the data set.

In [1021]:
# Drop the areas from the data set where unemployment rates are over six percent
filtered_data_indicators = filtered_data_indicators[filtered_data_indicators['UnemploymentRate'] < 6]
# Check the dimensions of the data
filtered_data_indicators.shape

(38, 11)

Lets check the changes on the map too.

In [1022]:
colormap = cm.linear.YlOrRd_08.to_step(data=filtered_data_indicators['UnemploymentRate'], index=[0, 2, 4, 6, 8, 10])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Unemployment rates for the postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['UnemploymentRate']) else colormap(x['properties']['UnemploymentRate']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'UnemploymentRate'],
            aliases=['Postal code', 'Neighbourhood', 'Unemployment rate'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

Next we have the median income of the postal code areas. Lets visualise the data for further investigation.

In [1025]:
colormap = cm.linear.RdYlBu_09.to_step(data=filtered_data_indicators['MedianIncome'], index=[10000, 15000, 20000, 25000, 30000, 35000, 40000])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Median income for the postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['MedianIncome']) else colormap(x['properties']['MedianIncome']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'MedianIncome'],
            aliases=['Postal code', 'Neighbourhood', 'Median income'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

We can again see that there are some areas which are quite different from the others. Again its the combination of Westend and Otaniemi. We can see that Otaniemi has very low median income due to its large number of students. We can also see Westend having the largest median income. Since the income of the area could indicate some restlesness in the neighbourhood lets drop areas that have median income of 15000 euros or less.

In [1026]:
# Drop areas that have median income of 15000 euros or less
filtered_data_indicators = filtered_data_indicators[filtered_data_indicators['MedianIncome'] > 15000]
# Check the dimensions
filtered_data_indicators.shape

(37, 11)

Lets re-render the map for inspecting changes

In [1027]:
colormap = cm.linear.RdYlBu_09.to_step(data=filtered_data_indicators['MedianIncome'], index=[10000, 15000, 20000, 25000, 30000, 35000, 40000])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "Median income for the postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['MedianIncome']) else colormap(x['properties']['MedianIncome']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'MedianIncome'],
            aliases=['Postal code', 'Neighbourhood', 'Median income'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

We can see that we only removed the Otaniemi region from the data set. Lets continue by checking the average living space in square meters on the map.

In [1028]:
colormap = cm.linear.RdYlBu_09.to_step(data=filtered_data_indicators['LivingSpace'], index=[20, 25, 30, 35, 40, 45, 50, 55])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "The living space in m2 per postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['LivingSpace']) else colormap(x['properties']['LivingSpace']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'LivingSpace'],
            aliases=['Postal code', 'Neighbourhood', 'The living space in m2'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

<p>There seem to be some more dense areas with less living area. These seem to be mostly in the transit hubs and on the in the Helsinki area.</p>
<p>Since we are looking for spacious areas outside the hassle of the capitals center we can also assume that we care about having more space so lets remove the areas where average living space is 35 square meters or less.</p>

In [1031]:
# Remove areas where the average living space is 35 square meters or less
filtered_data_indicators = filtered_data_indicators[filtered_data_indicators['LivingSpace'] > 35]
# Check the dimensions of the remaining data
filtered_data_indicators.shape

(20, 11)

Lets once again visualise the changes on map.

In [1032]:
colormap = cm.linear.RdYlBu_09.to_step(data=filtered_data_indicators['LivingSpace'], index=[20, 25, 30, 35, 40, 45, 50, 55])

# Generate at new folium map from latitude and longitude values
map_uusimaa = folium.Map(location=[latitude, longitude], zoom_start=11)

folium.TileLayer('CartoDB positron',name="Light Map",control=False).add_to(map_uusimaa)
colormap.caption = "The living space in m2 per postal areas"

style_function = lambda x: {"weight":0.5, 
                            'color':'black',
                            'fillColor': 'white' if pd.isna(x['properties']['LivingSpace']) else colormap(x['properties']['LivingSpace']), 
                            'nan_fill_color': 'white',
                            'fillOpacity':0.7}

highlight_function = lambda x: {'fillColor': '#000000', 
                                'color':'#000000', 
                                'fillOpacity': 0.50, 
                                'weight': 0.1}

GeoJSON =folium.features.GeoJson(
        filtered_data_indicators,
        style_function=style_function,
        control=False,
        highlight_function=highlight_function,
        tooltip=folium.features.GeoJsonTooltip(fields=['PostalCode', 'Neighbourhood', 'LivingSpace'],
            aliases=['Postal code', 'Neighbourhood', 'The living space in m2'],
            style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;"),
            sticky=True
        )
    )

colormap.add_to(map_uusimaa)

map_uusimaa.add_child(GeoJSON)

# Mark the workplace on the map
folium.CircleMarker(
        [latitude_workplace, longitude_workplace],
        radius=5,
        popup='Workplace',
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_uusimaa) 

map_uusimaa

We can see that we have reduced the original list of neighbourhoods quite a bit. There are only 20 postal code areas left.

<b>2. Examine the remaining neighbourhoods</b>

Now that we have only 20 postal code areas left in the data set we can look into the remaining data in some more detail before trying to cluster the data and try to see if we can still find way to filter the data. Lets first merge the venues data into our data set.

In [1039]:
# Merge the indicators data to the data set
neighbourhoods_data = filtered_data_indicators.merge(neighborhoods_venues_sorted, how="left", on="PostalCode")
# Print out the whole remaining data set for inspection
neighbourhoods_data

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice,MedianIncome,LivingSpace,UnemploymentRate,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,340,Kuusisaari-Lehtisaari,Helsinki,"MULTIPOLYGON (((24.86319 60.17061, 24.86321 60...",16,24.856462,60.18357,4243,33289,53.0,2.829655,Restaurant,Café,Pizza Place,French Restaurant,Liquor Store
1,620,Metsälä-Etelä-Oulunkylä,Helsinki,"MULTIPOLYGON (((24.92161 60.21749, 24.91464 60...",43,24.940437,60.221955,4370,24966,40.0,3.226788,Bus Stop,Pizza Place,Shopping Mall,Flea Market,Bar
2,660,Länsi-Pakila,Helsinki,"MULTIPOLYGON (((24.90764 60.23635, 24.90937 60...",43,24.928067,60.24026,3905,30035,40.3,2.798676,Soccer Field,Bus Stop,Hockey Rink,Grocery Store,ATM
3,1630,Hämeenkylä,Vantaa,"MULTIPOLYGON (((24.80685 60.26766, 24.80765 60...",44,24.794271,60.267138,3191,27762,37.4,2.797877,Bus Stop,Flea Market,Soccer Field,ATM,Paintball Field
4,1640,Hämevaara,Vantaa,"MULTIPOLYGON (((24.79256 60.25421, 24.79486 60...",30,24.802189,60.251044,3482,25673,37.8,4.906542,Pizza Place,Bus Stop,Grocery Store,Soccer Field,ATM
5,2130,Pohjois-Tapiola,Espoo,"MULTIPOLYGON (((24.78834 60.18115, 24.78870 60...",17,24.793208,60.187425,4278,30931,38.5,2.691218,Soccer Field,Insurance Office,Theater,Beach,ATM
6,2140,Laajalahti,Espoo,"MULTIPOLYGON (((24.79577 60.19624, 24.79610 60...",18,24.805547,60.20277,3539,30296,36.8,3.556993,Bus Stop,Cafeteria,Brewery,Sauna / Steam Room,Convenience Store
7,2160,Westend,Espoo,"MULTIPOLYGON (((24.83567 60.13041, 24.83515 60...",20,24.801453,60.167467,4475,37207,51.7,1.888604,Harbor / Marina,Restaurant,Bus Stop,Beach,Taxi Stand
8,2170,Haukilahti,Espoo,"MULTIPOLYGON (((24.79042 60.15631, 24.79012 60...",27,24.775274,60.15928,4524,30677,42.5,3.456181,Bus Stop,Dance Studio,Burger Joint,Yoga Studio,Pizza Place
9,2210,Olari,Espoo,"MULTIPOLYGON (((24.70285 60.18105, 24.70361 60...",36,24.72759,60.173252,3244,25598,37.1,4.295652,Bus Stop,Playground,Supermarket,Auto Dealership,Cafeteria


We can see that of the remaining data most seem to be Espoo municipality area. Three are in Helsinki and only two in Vantaa. Lets continue by sorting the data by the transit duration.

In [1041]:
neighbourhoods_data.sort_values(by='Duration').head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice,MedianIncome,LivingSpace,UnemploymentRate,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
0,340,Kuusisaari-Lehtisaari,Helsinki,"MULTIPOLYGON (((24.86319 60.17061, 24.86321 60...",16,24.856462,60.18357,4243,33289,53.0,2.829655,Restaurant,Café,Pizza Place,French Restaurant,Liquor Store
5,2130,Pohjois-Tapiola,Espoo,"MULTIPOLYGON (((24.78834 60.18115, 24.78870 60...",17,24.793208,60.187425,4278,30931,38.5,2.691218,Soccer Field,Insurance Office,Theater,Beach,ATM
6,2140,Laajalahti,Espoo,"MULTIPOLYGON (((24.79577 60.19624, 24.79610 60...",18,24.805547,60.20277,3539,30296,36.8,3.556993,Bus Stop,Cafeteria,Brewery,Sauna / Steam Room,Convenience Store
7,2160,Westend,Espoo,"MULTIPOLYGON (((24.83567 60.13041, 24.83515 60...",20,24.801453,60.167467,4475,37207,51.7,1.888604,Harbor / Marina,Restaurant,Bus Stop,Beach,Taxi Stand
8,2170,Haukilahti,Espoo,"MULTIPOLYGON (((24.79042 60.15631, 24.79012 60...",27,24.775274,60.15928,4524,30677,42.5,3.456181,Bus Stop,Dance Studio,Burger Joint,Yoga Studio,Pizza Place


<p>We can see that there are four postal code areas where the transit duration is 20 minutes or less which is very good distance wise. Most of the areas with exception of Laajalahti are a bit higher in terms of average price though. The closes area seems to reside in Helsinki but the rest seem to be in Espoo. We can also see that the unemployment rates of these closer areas seem to be in the lower end of the data set.</p>

<p>Lets try sorting the data by average price to see more about the data.</p>

In [1042]:
neighbourhoods_data.sort_values(by='AveragePrice').head()

Unnamed: 0,PostalCode,Neighbourhood,Municipality,geometry,Duration,Longitude,Latitude,AveragePrice,MedianIncome,LivingSpace,UnemploymentRate,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue
17,2710,Viherlaakso,Espoo,"MULTIPOLYGON (((24.74518 60.23624, 24.74554 60...",41,24.747626,60.233753,2771,25748,38.2,4.705684,Stables,Bus Stop,ATM,Outlet Store,Pet Store
14,2360,Soukka,Espoo,"MULTIPOLYGON (((24.68386 60.14916, 24.68365 60...",42,24.670859,60.139456,3082,24942,39.0,5.031256,Bus Stop,Skating Rink,Forest,Train Station,Grocery Store
15,2720,Lähderanta,Espoo,"MULTIPOLYGON (((24.74518 60.23624, 24.74506 60...",40,24.745892,60.240701,3126,26287,37.3,3.805073,Bus Stop,Buffet,Gym / Fitness Center,Soccer Field,Café
3,1630,Hämeenkylä,Vantaa,"MULTIPOLYGON (((24.80685 60.26766, 24.80765 60...",44,24.794271,60.267138,3191,27762,37.4,2.797877,Bus Stop,Flea Market,Soccer Field,ATM,Paintball Field
9,2210,Olari,Espoo,"MULTIPOLYGON (((24.70285 60.18105, 24.70361 60...",36,24.72759,60.173252,3244,25598,37.1,4.295652,Bus Stop,Playground,Supermarket,Auto Dealership,Cafeteria


We can immediately see a raise in the transit duration times as the average prices go up. The cheapest of the areas is Viherlaakso with an average price of 2771 euros. Four out of five cheapest areas seem to be in Espoo municipality and one being in Vantaa municipality. The unemployment rates seem to climb up on these areas with smaller average prices.

<b>3. Clustering the areas</b>

Now that we have the data set and we have inspected the data lets see if we can learn something new by clustering the neighbourhoods. We will be using k-means for clustering.

<h3>Results</h3>

<h3>Discussion</h3>

<h3>Conclusion</h3>