# Capstone Project - Finnish...

## Assignment Instructions from the course

Coursera Course: _Applied Data Science Capstone_

You have the opportunity to be as creative as you want and come up with an idea to leverage the Foursquare location data to *explore or compare neighborhoods or cities*
of your choice **or** to *come up with a problem that you can use the Foursquare location data to solve*. If you cannot think of an idea or a problem, here are some ideas to get you started:

1. In Module 3, we explored New York City and the city of Toronto and segmented and clustered their neighborhoods. Both cities are very diverse and are the financial capitals of their respective countries. One interesting idea would be to compare the neighborhoods of the two cities and determine how similar or dissimilar they are. Is New York City more like Toronto or Paris or some other multicultural city? I will leave it to you to refine this idea.

2. In a city of your choice, if someone is looking to open a restaurant, where would you recommend that they open it? Similarly, if a contractor is trying to start their own business, where would you recommend that they setup their office?

These are just a couple of many ideas and problems that can be solved using location data in addition to other datasets. No matter what you decide to do, make sure to provide sufficient justification of why you think what you want to do or solve is important and why would a client or a group of people be interested in your project.


For the first week, you will required to submit the following:

1. A description of the problem and a discussion of the background. (15 marks)
2. A description of the data and how it will be used to solve the problem. (15 marks)


For the second week, the final deliverables of the project will be:

1. A link to your Notebook on your Github repository, showing your code. (15 marks)
2. A full report consisting of all of the following components (15 marks):
    - **Introduction** where you discuss the business problem and who would be interested in this project.
    - **Data** where you describe the data that will be used to solve the problem and the source of the data.
    - **Methodology** section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why.
    - **Results** section where you discuss the results.
    - **Discussion** section where you discuss any observations you noted and any recommendations you can make based on the results.
    - **Conclusion** section where you conclude the report.
3. Your choice of a presentation or blogpost. (10 marks)


## Working notes

Wikipedia:
- Luettelo Suomen kunnista, sisältäen mm. väkiluku ja pinta-ala: https://fi.wikipedia.org/wiki/Luettelo_Suomen_kunnista
- Luettelo Suomen maakunnista, sis samat: https://fi.wikipedia.org/wiki/Suomen_maakunnat
- Luettelo Suomen postinumeroista kunnittain: https://fi.wikipedia.org/wiki/Luettelo_Suomen_postinumeroista_kunnittain


- http://tilastokeskus.fi/org/avoindata/paikkatietoaineistot.html

Tilastokeskus, Paavo:
- Tilastokeskus, Paavo kuvaus (postinumeroalueittainen avoin tieto): http://tilastokeskus.fi/tup/paavo/paavo_kuvaus_fi.pdf
- Tilastokeskuksen PX-Web-tietokannat (Paavo): http://pxnet2.stat.fi/PXWeb/pxweb/fi/Postinumeroalueittainen_avoin_tieto/
- PX-Web:ssä valittu 2019 -aineisto => 9. Kaikki tietoryhmät (päivitetty 22.1.2019) ja seuraavalla sivulla valittu kaikki postinumeroalueet ja kaikki tiedot => valinta yhteensä 314 808 taulukkosolua, lataus "puolipiste-eroteltu (otsikollinen)".  Tallennettu tiedostoon *paavo_9_koko.csv*
- from the saved file, first two rows (empty rows in the beginning of the file) were removed.
- coordinates used: **EUREF-FIN** coordinate system (**ETRS89-TM35FIN**)


- toinen lähde: http://spatialreference.org/ref/epsg/?search=finland&srtext=Search
- antaa EPSG:2393: KKJ / Finland Uniform Coordinate System
- tämäkin lähde auttoi: http://www.kolumbus.fi/eino.uikkanen/geodocsgb/ficoords.htm
- Test coordinates visually (are they in right place) here: https://suomenkartta.fi/karttakoordinaatit/


Protocol: Web Feature Service (WFS)
- In python lib OWSLib 0.17.1 here: https://geopython.github.io/OWSLib/
- returns GML files?


# Solution

- Using FourSquare to access venue data in selected areas
- Using Folium to show results on map
- Using Paavo data to get finnish postal codes and some information about each postal code area

...

## Note: clear FS ids before submission!

FS\_ prefixed parameters refer to FourSquare service and its use.  Don't share the following two parameters, even though moderate personal use such as this notebook is free of cost.

In [1]:

# CLEAR THESE FOR THE FINAL DELIVERY

FS_CLIENT_ID = 'POKB0VB5FPZYN0ER35ARSDR3QZE3CKSVZ3I1FQ4FZWEFLARL' # your Foursquare ID
FS_CLIENT_SECRET = 'YQKQR2Q530BQ21CFBNAQZ4DQF3QOLMO5D0CADDA1BQH1FPZU' # your Foursquare Secret

## Step 0 - import libraries

In [2]:
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
import os
import math
import requests # library to handle requests
import operator

#!conda install -c conda-forge geopy --yes 
#from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values

# libraries for displaying images
#from IPython.display import Image 
#from IPython.core.display import HTML 
    
# tranforming json file into a pandas dataframe library
#from pandas.io.json import json_normalize

#!conda install -c conda-forge folium=0.5.0 --yes
# OR
#!pip install folium
import folium # plotting library

# import k-means from clustering stage
from sklearn.cluster import KMeans
from sklearn import preprocessing

#!pip install seaborn
import seaborn as sns

# Need to install pyproj first, if possible, check this out...
#!pip install pyproj
import pyproj

import matplotlib.pyplot as plt

print('Libraries imported.')

Libraries imported.


## Step 1 - Paavo data

### Step 1.1 - Load Paavo data

In [3]:
PAAVO_FILENAME = 'paavo_9_koko_sarkain_fixed.csv'

paavo_df = None

if os.path.isfile(PAAVO_FILENAME):

    paavo_df = pd.read_csv(PAAVO_FILENAME, sep='\t', encoding='iso-8859-1')
    print("Loaded Paavo data.\nFound {} rows of data.".format(paavo_df.shape[0]))

else:
    print("Did not find data file:", PAAVO_FILENAME)
    print("Are you perhaps in a wrong working directory?  Currently you are in:", os.getcwd())
    print("Use commands like '!cd ...' or in python 'os.chdir()' to fix your working directory.")

# Show a sample of data to understand what we have
paavo_df.head()

Loaded Paavo data.
Found 3027 rows of data.


Unnamed: 0,Postinumeroalue,X-koordinaatti metreinä,Y-koordinaatti metreinä,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2017 (HE)","Naiset, 2017 (HE)","Miehet, 2017 (HE)","Asukkaiden keski-ikä, 2017 (HE)","0-2-vuotiaat, 2017 (HE)","3-6-vuotiaat, 2017 (HE)",...,"T Kotitalouksien toiminta työnantajina; kotitalouksien eriyttämätön toiminta tavaroiden ja palveluiden tuottamiseksi omaan käyttöön, 2016 (TP)","U Kansainvälisten organisaatioiden ja toimielinten toiminta, 2016 (TP)","X Toimiala tuntematon, 2016 (TP)","Asukkaat yhteensä, 2016 (PT)","Työlliset, 2016 (PT)","Työttömät, 2016 (PT)","Lapset 0-14 -vuotiaat, 2016 (PT)","Opiskelijat, 2016 (PT)","Eläkeläiset, 2016 (PT)","Muut, 2016 (PT)"
0,KOKO MAA,429300,7084490,390813692400,5513130,2793999,2719131,42,160297,240994,...,656,343,59,5503297,2275679,355837,894178,407905,1389830,179868
1,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),385114,6672391,2353278,18284,9613,8671,41,434,521,...,0,17,1,18035,10032,856,1812,1198,3326,811
2,00120 Punavuori (Helsinki ),385614,6671378,414010,7108,3751,3357,40,183,234,...,0,0,0,7055,3872,336,817,428,1242,360
3,00130 Kaartinkaupunki (Helsinki ),386228,6671492,428960,1508,772,736,41,34,48,...,0,12,0,1522,839,61,170,104,258,90
4,00140 Kaivopuisto - Ullanlinna (Helsinki ),386410,6670742,931841,7865,4277,3588,41,224,250,...,0,1,0,7934,4218,325,929,518,1519,425


### Step 1.2 - Paavo data, exploration

Here we explore the pure Paavo data features / columns as they are available from Tilastokeskus, that is, no fixes, cleaning or data manipulation on values.

Paavo data contains data in 105 columns for 3026 postal codes plus for the whole finland as well.  Below are the different datacolumns and their values for whole Finland, Helsinki downtown and a random rural area. The data is from years 2016 and 2017, depending on the data.

In [4]:
# Create a subset dataframe to inspect data.  In the transposed dataframe:
#    - Column 0 is for whole Finland,
#    - Column 1 is for postal code 00100 (Finlands Capital, center)
#    - Column 2600 is for postal code 89840 (very rural area)
#
paavo_fin_df = paavo_df.T[[0, 1, 2600]]
paavo_fin_df.columns = ["Whole Finland", paavo_df.iloc[1,0], paavo_df.iloc[2600,0]]


#### Step 1.2.1 - Paavo data exploration - population and age groups, 2017

In [5]:
paavo_fin_df[4:28]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Asukkaat yhteensä, 2017 (HE)",5513130,18284,57
"Naiset, 2017 (HE)",2793999,9613,20
"Miehet, 2017 (HE)",2719131,8671,37
"Asukkaiden keski-ikä, 2017 (HE)",42,41,63
"0-2-vuotiaat, 2017 (HE)",160297,434,0
"3-6-vuotiaat, 2017 (HE)",240994,521,0
"7-12-vuotiaat, 2017 (HE)",369950,711,0
"13-15-vuotiaat, 2017 (HE)",177163,274,0
"16-17-vuotiaat, 2017 (HE)",117857,185,0
"18-19-vuotiaat, 2017 (HE)",120218,264,1


#### Step 1.2.2 - Paavo data exploration - Education, 2017

In [6]:
paavo_fin_df[28:35]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"18 vuotta täyttäneet yhteensä, 2017 (KO)",4446869,16159,57
"Perusasteen suorittaneet, 2017 (KO)",1112261,1996,30
"Koulutetut yhteensä, 2017 (KO)",3334608,14163,27
"Ylioppilastutkinnon suorittaneet, 2017 (KO)",303230,2618,1
"Ammatillisen tutkinnon suorittaneet, 2017 (KO)",2035528,2942,24
"Alemman korkeakoulututkinnon suorittaneet, 2017 (KO)",518969,2899,2
"Ylemmän korkeakoulututkinnon suorittaneet, 2017 (KO)",476881,5704,0


#### Step 1.2.3 - Paavo data exploration - Inhabitant income, 2016

In [7]:
paavo_fin_df[35:42]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"18 vuotta täyttäneet yhteensä, 2016 (HR)",4431392,15935,60
"Asukkaiden keskitulot, 2016 (HR)",23812,38985,16166
"Asukkaiden mediaanitulot, 2016 (HR)",20925,26642,14939
"Alimpaan tuloluokkaan kuuluvat asukkaat, 2016 (HR)",886431,2856,26
"Keskimmäiseen tuloluokkaan kuuluvat asukkaat, 2016 (HR)",2658687,6668,31
"Ylimpään tuloluokkaan kuuluvat asukkaat, 2016 (HR)",886274,6411,3
"Asukkaiden ostovoimakertymä, 2016 (HR)",105520349469,621218859,969978


#### Step 1.2.4 - Paavo data exploration - Households, 2017

In [8]:
paavo_fin_df[42:57]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Taloudet yhteensä, 2017 (TE)",2680077.0,10205.0,34.0
"Talouksien keskikoko, 2017 (TE)",2.0,1.8,1.7
"Asumisväljyys, 2017 (TE)",40.5,38.6,56.4
"Nuorten yksinasuvien taloudet, 2017 (TE)",291052.0,2101.0,1.0
"Lapsettomat nuorten parien taloudet, 2017 (TE)",115168.0,861.0,0.0
"Lapsitaloudet, 2017 (TE)",570112.0,1326.0,0.0
"Pienten lasten taloudet, 2017 (TE)",142781.0,400.0,0.0
"Alle kouluikäisten lasten taloudet, 2017 (TE)",278849.0,715.0,0.0
"Kouluikäisten lasten taloudet, 2017 (TE)",263490.0,541.0,0.0
"Teini-ikäisten lasten taloudet, 2017 (TE)",221106.0,373.0,0.0


#### Step 1.2.5 - Paavo data exploration - Household income, 2016

In [9]:
paavo_fin_df[57:64]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Taloudet yhteensä, 2016 (TR)",2654657,10042,36
"Talouksien keskitulot, 2016 (TR)",39270,61679,26975
"Talouksien mediaanitulot, 2016 (TR)",31824,38895,23598
"Alimpaan tuloluokkaan kuuluvat taloudet, 2016 (TR)",677223,1697,13
"Keskimmäiseen tuloluokkaan kuuluvat taloudet, 2016 (TR)",1500917,4123,22
"Ylimpään tuloluokkaan kuuluvat taloudet, 2016 (TR)",476517,4222,1
"Talouksien ostovoimakertymä, 2016 (TR)",104247634221,619383515,971110


#### Step 1.2.6 - Paavo data exploration - Buildings, 2017

In [10]:
paavo_fin_df[64:72]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Kesämökit yhteensä, 2017 (RA)",507200.0,0.0,103.0
"Rakennukset yhteensä, 2017 (RA)",1523196.0,634.0,90.0
"Muut rakennukset yhteensä, 2017 (RA)",228770.0,326.0,14.0
"Asuinrakennukset yhteensä, 2017 (RA)",1294426.0,308.0,76.0
"Asunnot, 2017 (RA)",2946814.0,11884.0,48.0
"Asuntojen keskipinta-ala, 2017 (RA)",80.1,65.9,97.7
"Pientaloasunnot, 2017 (RA)",1568029.0,2.0,48.0
"Kerrostaloasunnot, 2017 (RA)",1378785.0,11882.0,0.0


#### Step 1.2.7 - Paavo data exploration - Jobs, 2016

In [11]:
paavo_fin_df[72:98]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Työpaikat yhteensä, 2016 (TP)",2094313,48470,16
"Alkutuotannon työpaikat, 2016 (TP)",56104,104,0
"Jalostuksen työpaikat, 2016 (TP)",461153,1805,0
"Palveluiden työpaikat, 2016 (TP)",1576997,46560,16
"A Maatalous, metsätalous ja kalatalous, 2016 (TP)",56104,104,0
"B Kaivostoiminta ja louhinta, 2016 (TP)",5283,0,0
"C Teollisuus, 2016 (TP)",283209,752,0
"D Sähkö-, kaasu- ja lämpöhuolto, jäähdytysliiketoiminta, 2016 (TP)",11714,554,0
"E Vesihuolto, viemäri- ja jätevesihuolto ja muu ympäristön puhtaanapito, 2016 (TP)",10703,1,0
"F Rakentaminen, 2016 (TP)",150244,498,0


#### Step 1.2.8 - Paavo data exploration - Main type of activity, 2016

In [12]:
paavo_fin_df[98:]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),89840 Ylä-Vuokki (Suomussalmi )
"Asukkaat yhteensä, 2016 (PT)",5503297,18035,61
"Työlliset, 2016 (PT)",2275679,10032,13
"Työttömät, 2016 (PT)",355837,856,6
"Lapset 0-14 -vuotiaat, 2016 (PT)",894178,1812,0
"Opiskelijat, 2016 (PT)",407905,1198,1
"Eläkeläiset, 2016 (PT)",1389830,3326,40
"Muut, 2016 (PT)",179868,811,1


### Step 1.3 - Clean data

Paavo contains ".." values (two dots) in certain columns for such postal code areas where there are less than 30 cases in total for the data section.  Here cleaning means removing such rows from data so that we have only numeric values in columns other than the first column. 

Also, drop the first row of data which contains totals for whole Finland.

But first, define some helper functions.

In [13]:
# Helper functions to analyze / report data values.

# Find out how many of each value there are in a list.
# Takes in a list and returns a dictionary, whose keys are list values, and values are # of occurrences on list
def find_uniques_with_counts(l, key_counts={}):
    for list_value in l:
        list_value = str(list_value)
        if list_value in key_counts.keys():
            key_counts[list_value] = key_counts[list_value] + 1
        else:
            key_counts[list_value] = 1
    return key_counts

def print_uniques_with_counts_dict(key_counts, print_threshold=0, total_values=0):
    for i in (range(len(key_counts.keys()))):
        max_key = max(key_counts.items(), key=operator.itemgetter(1))[0]
        if total_values == 0:
            total_percentage = ""
        else:
            total_percentage = int(100*(key_counts[max_key]/total_values))
        if key_counts[max_key] > print_threshold:
            print("{:>25} -- {:3} ({:2}%)".format(max_key, key_counts[max_key], total_percentage))
        key_counts.pop(max_key, None)

# Helper function to do the filtering and report as it goes.
def clean_paavo(df, col_index):
    col_name = df.columns[col_index]
    print("Cleaning away postal codes which have less than 30 inhabitants in column\n", col_name)
    count_before = df.shape[0]
    filtered_df = df[df[col_name] >= 30]
    count_after = filtered_df.shape[0]
    print("\tPostal Codes cleaned away:   ", count_before - count_after)
    print("\tContinuing with:              ", count_after, "Postal Codes\n")
    return filtered_df

print("find_uniques_with_counts() defined.")
print("print_uniques_with_counts_dict() defined.")
print("clean_paavo() defined.")

find_uniques_with_counts() defined.
print_uniques_with_counts_dict() defined.
clean_paavo() defined.


Check the distribution of Paavo data column dtypes (of the dataframe). Remember, there are 105 columns in total.

In [14]:
t1 = paavo_df.dtypes.tolist()
d = find_uniques_with_counts(t1)
print_uniques_with_counts_dict(d, 0, 105)

                   object --  91 (86%)
                    int64 --  14 (13%)


Now clean the ".." dots away.

In [15]:
print("\nCleaning Paavo data, start with {} postal codes.\n".format(paavo_df.shape[0]))

paavo_filtered_df = paavo_df

# filter on population structure (age) having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 4)

# filter on education having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 28)

# filter on income having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 35)

# filter on households having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 42)

# filter on household income having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 57)

# filter on jobs having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 72)

# filter on main type of activity having less than 30 cases
paavo_filtered_df = clean_paavo(paavo_filtered_df, 98)



Cleaning Paavo data, start with 3027 postal codes.

Cleaning away postal codes which have less than 30 inhabitants in column
 Asukkaat yhteensä, 2017 (HE)
	Postal Codes cleaned away:    75
	Continuing with:               2952 Postal Codes

Cleaning away postal codes which have less than 30 inhabitants in column
 18 vuotta täyttäneet yhteensä, 2017 (KO)
	Postal Codes cleaned away:    11
	Continuing with:               2941 Postal Codes

Cleaning away postal codes which have less than 30 inhabitants in column
 18 vuotta täyttäneet yhteensä, 2016 (HR)
	Postal Codes cleaned away:    4
	Continuing with:               2937 Postal Codes

Cleaning away postal codes which have less than 30 inhabitants in column
 Taloudet yhteensä, 2017 (TE)
	Postal Codes cleaned away:    107
	Continuing with:               2830 Postal Codes

Cleaning away postal codes which have less than 30 inhabitants in column
 Taloudet yhteensä, 2016 (TR)
	Postal Codes cleaned away:    1
	Continuing with:               282

Then try to convert the rest of columns to some numeric type, preferrence order is int, int64 and float.

In [16]:
print("Ensure number values in columns from 3 to end.\n")
for i in range(3, len(paavo_filtered_df.columns)):
    colname = paavo_filtered_df.columns[i]
    # don't convert floats, as they would become ints and lose the decimal parts.
    if str(paavo_filtered_df.iloc[:,i].dtype) != "float64":
        try:
            paavo_filtered_df.iloc[:,i] = paavo_filtered_df.iloc[:,i].astype("int")
        except:
            try:
                paavo_filtered_df.iloc[:,i] = paavo_filtered_df.iloc[:,i].astype(np.int64)
            except:
                try:
                    paavo_filtered_df.iloc[:,i] = paavo_filtered_df.iloc[:,i].astype("float")
                except:
                    print("*** Failed to convert {}th column ({}) to number".format(i, colname))

print("Column data types after cleaning:\n")
                    
t1 = paavo_filtered_df.dtypes.tolist()
d = find_uniques_with_counts(t1)
print_uniques_with_counts_dict(d, 0, 105)

Ensure number values in columns from 3 to end.

Column data types after cleaning:

                    int32 --  97 (92%)
                    int64 --   4 ( 3%)
                  float64 --   3 ( 2%)
                   object --   1 ( 0%)


Drop Finland country level totals from further analysis and check the data head after cleaning.

In [17]:
paavo_filtered_df = paavo_filtered_df.iloc[1:,:]
paavo_filtered_df.reset_index(inplace=True)
paavo_filtered_df.drop(columns=["index"], inplace=True)
paavo_filtered_df.head()

Unnamed: 0,Postinumeroalue,X-koordinaatti metreinä,Y-koordinaatti metreinä,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2017 (HE)","Naiset, 2017 (HE)","Miehet, 2017 (HE)","Asukkaiden keski-ikä, 2017 (HE)","0-2-vuotiaat, 2017 (HE)","3-6-vuotiaat, 2017 (HE)",...,"T Kotitalouksien toiminta työnantajina; kotitalouksien eriyttämätön toiminta tavaroiden ja palveluiden tuottamiseksi omaan käyttöön, 2016 (TP)","U Kansainvälisten organisaatioiden ja toimielinten toiminta, 2016 (TP)","X Toimiala tuntematon, 2016 (TP)","Asukkaat yhteensä, 2016 (PT)","Työlliset, 2016 (PT)","Työttömät, 2016 (PT)","Lapset 0-14 -vuotiaat, 2016 (PT)","Opiskelijat, 2016 (PT)","Eläkeläiset, 2016 (PT)","Muut, 2016 (PT)"
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki ),385114,6672391,2353278,18284,9613,8671,41,434,521,...,0,17,1,18035,10032,856,1812,1198,3326,811
1,00120 Punavuori (Helsinki ),385614,6671378,414010,7108,3751,3357,40,183,234,...,0,0,0,7055,3872,336,817,428,1242,360
2,00130 Kaartinkaupunki (Helsinki ),386228,6671492,428960,1508,772,736,41,34,48,...,0,12,0,1522,839,61,170,104,258,90
3,00140 Kaivopuisto - Ullanlinna (Helsinki ),386410,6670742,931841,7865,4277,3588,41,224,250,...,0,1,0,7934,4218,325,929,518,1519,425
4,00150 Eira - Hernesaari (Helsinki ),385235,6670549,1367328,9496,5129,4367,40,250,282,...,7,8,0,9527,5433,536,942,564,1523,529


### Step 1.3 - Enhance the data

1. Latitude and Longitude values for coordinates in addition to the metric X and Y values (use pyproj for this)
2. Separate the postal code number (5 digits) into a column of its own
3. Separate the city name (in parenthes) into a column of its own
4. Separate the postal code area name (the rest, between number and city name) into a column of its own.


In [18]:
# Helper functions

# Following helper functions assume Paavo 'postal code' field value has
# Structure: 'NNNN <postal code area name> (<city name>)'
# Example: '00120 Punavuori (Helsinki )'

# Return the postal code number from Paavo 'postal code' field value
def get_pc_number(pc):
    return pc[:5].strip()

# Return the city name from Paavo 'postal code' field value
def get_city_name(pc):
    loc = pc.find('(')
    return pc[loc:].strip('() ')

# Return the postal code area name from Paavo 'postal code' field value
def get_pc_area_name(pc):
    loc = pc.find('(')
    return pc[5:loc].strip()
    
print("get_pc_number() defined.")
print("get_city_name() defined.")
print("get_pc_area_name() defined.")


get_pc_number() defined.
get_city_name() defined.
get_pc_area_name() defined.


In [19]:
# store original columns for easier sorting of new columns to the beginning
orig_cols = paavo_filtered_df.columns.tolist()

# 1. Get the latitude and longitude values, based on X and Y meters.
p = pyproj.Proj(proj='utm',zone=35,ellps='WGS84') # use kwarg
x_list = paavo_filtered_df.iloc[:,1].tolist()
y_list = paavo_filtered_df.iloc[:,2].tolist()
lon_list, lat_list = p(x_list, y_list, inverse=True)

# Add latitude and longitude into data
paavo_filtered_df["PC Longitude"] = lon_list
paavo_filtered_df["PC Latitude"] = lat_list

# 2. Create postal code number into a new column
paavo_filtered_df["PC"] = paavo_filtered_df.iloc[:,0].str.slice(stop=5)

# 3. Create city name into a new column
paavo_filtered_df["City"] = paavo_filtered_df.iloc[:,0].map(get_city_name)

# 4. change old postal code column into one containing only the name
postal_code_column_name = paavo_filtered_df.columns[0]
paavo_filtered_df[postal_code_column_name] = paavo_filtered_df[postal_code_column_name].map(get_pc_area_name)


# Sort the columns so that the new columns are on the left side (easier to see)
new_cols = ["PC", postal_code_column_name, "City", "PC Longitude", "PC Latitude"]
paavo_filtered_df = paavo_filtered_df[new_cols + orig_cols[1:]]
paavo_filtered_df.rename(columns={postal_code_column_name: "PC Name"}, inplace=True)

paavo_filtered_df.head()

Unnamed: 0,PC,PC Name,City,PC Longitude,PC Latitude,X-koordinaatti metreinä,Y-koordinaatti metreinä,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2017 (HE)","Naiset, 2017 (HE)",...,"T Kotitalouksien toiminta työnantajina; kotitalouksien eriyttämätön toiminta tavaroiden ja palveluiden tuottamiseksi omaan käyttöön, 2016 (TP)","U Kansainvälisten organisaatioiden ja toimielinten toiminta, 2016 (TP)","X Toimiala tuntematon, 2016 (TP)","Asukkaat yhteensä, 2016 (PT)","Työlliset, 2016 (PT)","Työttömät, 2016 (PT)","Lapset 0-14 -vuotiaat, 2016 (PT)","Opiskelijat, 2016 (PT)","Eläkeläiset, 2016 (PT)","Muut, 2016 (PT)"
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,24.92929,60.172207,385114,6672391,2353278,18284,9613,...,0,17,1,18035,10032,856,1812,1198,3326,811
1,120,Punavuori,Helsinki,24.938865,60.163257,385614,6671378,414010,7108,3751,...,0,0,0,7055,3872,336,817,428,1242,360
2,130,Kaartinkaupunki,Helsinki,24.949856,60.164452,386228,6671492,428960,1508,772,...,0,12,0,1522,839,61,170,104,258,90
3,140,Kaivopuisto - Ullanlinna,Helsinki,24.953552,60.157772,386410,6670742,931841,7865,4277,...,0,1,0,7934,4218,325,929,518,1519,425
4,150,Eira - Hernesaari,Helsinki,24.932508,60.155712,385235,6670549,1367328,9496,5129,...,7,8,0,9527,5433,536,942,564,1523,529


Finally, create a standardized version of the Paavo data.  This is used later in analysis / clustering.

In [20]:
# Create a standardized version of Paavo data.  Don't include unnecessary columns

paavo_stan_cols = paavo_filtered_df.columns.tolist()
paavo_stan_cols = paavo_stan_cols[6:]
paavo_stan_df = paavo_filtered_df[paavo_stan_cols].astype('float64') # type conversions removes a warning
paavo_stan_np = preprocessing.StandardScaler().fit_transform(paavo_stan_df)
paavo_stan_df = pd.DataFrame(paavo_stan_np)
paavo_stan_df["PC"] = paavo_filtered_df["PC"]
paavo_stan_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,94,95,96,97,98,99,100,101,102,PC
0,-1.100272,-0.452328,4.776863,4.838401,4.69772,-0.568997,3.407284,2.724038,2.445841,1.874344,...,3.253556,5.478789,4.745716,6.251701,2.91584,2.597545,3.40643,3.349305,6.101162,100
1,-1.105567,-0.462058,1.389098,1.4319,1.338992,-0.765921,1.029136,0.815641,0.658872,0.484515,...,-0.031457,-0.149592,1.387119,1.963403,0.734109,0.750987,0.810015,0.759059,2.356837,120
2,-1.104971,-0.461983,-0.308423,-0.299244,-0.317618,-0.568997,-0.382594,-0.42116,-0.476645,-0.497112,...,2.287376,-0.149592,-0.305332,-0.148026,-0.419691,-0.44974,-0.282503,-0.463974,0.115223,130
3,-1.108892,-0.45946,1.618566,1.737567,1.484996,-0.568997,1.417598,0.922032,0.903167,0.785807,...,0.161779,-0.149592,1.655991,2.204272,0.687957,0.958841,1.113492,1.103348,2.896484,140
4,-1.109901,-0.457275,2.112969,2.232678,1.977365,-0.765921,1.66394,1.134815,0.853403,0.397043,...,1.514431,-0.149592,2.143262,3.050097,1.573236,0.982967,1.268602,1.10832,3.759921,150


## Step 2 - FourSquare

### Step 2.1 - FourSquare preparations

Define more FS\_ parameters and a helper function that we can easily get venue data for those locations that we choose.

In [21]:

# FourSquare call parameters
FS_VERSION = '20180604'
FS_LIMIT = 50

venue_data_columns = [
    'PC', 
    'PC Latitude', 
    'PC Longitude',
    'Venue Id',
    'Venue', 
    'Venue Latitude', 
    'Venue Longitude', 
    'Venue Category']

# getNearbyVenues accesses FourSquare venue data, returns a dataframe.
# Helper function from 2nd lab of week 3
def getNearbyVenues(postal_codes, latitudes, longitudes, radius=500, section=False):
    
    venues_list=[]
    for pc, lat, lng in zip(postal_codes, latitudes, longitudes):
        #print(pc)

        # create the API request URL
        url = ""
        if section:
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&section={}&limit={}'.format(
                FS_CLIENT_ID,
                FS_CLIENT_SECRET,
                FS_VERSION,
                lat,
                lng,
                radius,
                section,
                FS_LIMIT)
        else:
            url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
                FS_CLIENT_ID,
                FS_CLIENT_SECRET,
                FS_VERSION,
                lat,
                lng,
                radius,
                FS_LIMIT)
            
        # make the GET request
        try:
            results = requests.get(url).json()["response"]['groups'][0]['items']

            # return only relevant information for each nearby venue
            venues_list.append([(
                pc, 
                lat, 
                lng,
                v['venue']['id'],
                v['venue']['name'], 
                v['venue']['location']['lat'], 
                v['venue']['location']['lng'],  
                v['venue']['categories'][0]['name']) for v in results])
        except:
            print("SOMETHING WENT WRONG DOWNLOADING {} FROM FourSquare".format(pc))

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    if nearby_venues.shape[0] == 0:
        nearby_venues = pd.DataFrame(columns=venue_data_columns)
    else:
        nearby_venues.columns = venue_data_columns
    
    return(nearby_venues)

print("getNearbyVenues() defined")


# Update directly to the fs_venue_df
def update_fs_venues(df):
    global fs_venue_df

    work_df = df

    print("Filter results in {} postal codes. Starting...".format(work_df.shape[0]))

    # Loop through each postal code, check if there are already venues for it in fs_venue_df.
    # If yes, skip to next postal code, if not then download data and add it to fs_venue_df.
    for pc in work_df["PC"].tolist():
        #print(pc, type(pc))
        if fs_venue_df[fs_venue_df["PC"] == pc].shape[0] == 0:
            # No previous data, download and add
            print("downloading venues for {} from FourSquare.".format(pc), end="")
            
            pc_row_df = work_df[work_df["PC"] == pc]
            lon = pc_row_df.iloc[0,2]
            lat = pc_row_df.iloc[0,3]
            area = pc_row_df.iloc[0,6]
            r = int(math.sqrt(area))
            new_venues_df = getNearbyVenues([pc], [lat], [lon], r)
            fs_venue_df = pd.concat([fs_venue_df, new_venues_df])
            count = new_venues_df.shape[0]
            print("  Received {} venues.".format(count), end="")
            if count == 50:
                print(" More food, coffee and drinks?", end="")
                # Check for more 'food', 'drinks' and 'coffee' venues with the section info
                new_venues_df = getNearbyVenues([pc], [lat], [lon], r, 'food')
                fs_venue_df = pd.concat([fs_venue_df, new_venues_df])
                new_venues_df = getNearbyVenues([pc], [lat], [lon], r, 'coffee')
                fs_venue_df = pd.concat([fs_venue_df, new_venues_df])
                new_venues_df = getNearbyVenues([pc], [lat], [lon], r, 'drinks')
                fs_venue_df = pd.concat([fs_venue_df, new_venues_df])
            print(" Done.")
                
    print("Done.")


print("update_fs_venues() defined.")

#
# Note, it is worthwhile to save the data every now and then, as it is
# a lot of downloading, takes a lot time, and something can fail.
#

# Save and test
def save_fs_to_disk():
    fs_venue_df.to_csv(FS_DATA_FILENAME, index=False)

    # Check via reading data back in
    test_read_df = pd.read_csv(FS_DATA_FILENAME)
    if fs_venue_df.shape == test_read_df.shape:
        print("File saved successfully with {} rows of data.".format(fs_venue_df.shape[0]))
        # Release this data from memory, it was just for testing.
        test_read_df = None
    else:
        print("Something is wrong, files do not match.")
        print("Data shape:", fs_venue_df.shape)
        print("File shape:", test_read_df.shape)
        test_read_df = None

print("save_fs_to_disk() defined.")

getNearbyVenues() defined
update_fs_venues() defined.
save_fs_to_disk() defined.


### Step 2.2 - Load in any existing data FourSquare data

In [22]:
# Create a new empty dataframe for venue data
# In future, load this from file!

#FS_DATA_FILENAME = "FourSquare_downloaded_venues.csv"
FS_DATA_FILENAME = "FourSquare_downloaded_venues_new.csv"

fs_venue_df = None

if os.path.isfile(FS_DATA_FILENAME):
    # load from file
    print("Reading venues from file")
    fs_venue_df = pd.read_csv(FS_DATA_FILENAME, dtype={"PC": 'str'})
else:
    print("No prior venue data file found, creating empty data set.")
    fs_venue_df = pd.DataFrame(columns=venue_data_columns)

print(fs_venue_df.shape)
fs_venue_df.head()


Reading venues from file
(36690, 8)


Unnamed: 0,PC,PC Latitude,PC Longitude,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,100,60.172207,24.92929,4adcdb1ff964a5208b5f21e3,Konditoria Café Briossi,60.16732,24.938287,Bakery
1,100,60.172207,24.92929,4adcdb1ff964a5208d5f21e3,Fazer Café,60.168481,24.947506,Café
2,100,60.172207,24.92929,4adcdb1ff964a520a65f21e3,Café Strindberg,60.16769,24.946243,Café
3,100,60.172207,24.92929,4adcdb20f964a520ce5f21e3,KuuKuu,60.1754,24.92519,Scandinavian Restaurant
4,100,60.172207,24.92929,4adcdb20f964a520cf5f21e3,St. Urho's Pub,60.17397,24.9315,Beer Bar


In [23]:

FS_LOAD_ALL = True
FS_LOAD_SELECTED = False

# Downloading all the data gives us over 30 000 rows of data into the file
if FS_LOAD_ALL and (fs_venue_df.shape[0] < 30000):
    # Run the download and saving to disk

    last_l_limit = 0

    # Loop downloading in small sections of 100 postal codes and save into file
    for u_limit in list(range(0, paavo_filtered_df.shape[0], 100)):
        l_limit = u_limit - 200
        l_limit = max(0, l_limit)
        last_l_limit = l_limit
        print("Working on range:", l_limit, u_limit)
        count_before = fs_venue_df.shape[0]
        update_fs_venues(paavo_filtered_df.iloc[l_limit:u_limit,:])
        count_after = fs_venue_df.shape[0]
        print("Venue data has now {} rows of data.".format(count_after))
        if count_after > count_before:
            save_fs_to_disk()

    # Make sure we got the last ones, too
    print("Working on final range from", l_limit, "to the end.")
    count_before = fs_venue_df.shape[0]
    update_fs_venues(paavo_filtered_df.iloc[last_l_limit:,:])
    count_after = fs_venue_df.shape[0]
    print("Venue data has now {} rows of data.".format(count_after))
    if count_after > count_before:
        save_fs_to_disk()
    
    # Finally, remove any duplicate rows.  There can be some for postal codes with over 50 venues,
    # because in such cases we additionally downloaded only restaurants, coffee shops and drinking places.
    #
    # If there are duplicate rows, they are identical so just take the first in such cases
    fs_venue_df = fs_venue_df.groupby(["PC", "Venue Id"]).first().reset_index()
    fs_venue_df = fs_venue_df[venue_data_columns]
    save_fs_to_disk()

elif FS_LOAD_SELECTED:

    # Change this as you like
    selection = paavo_filtered_df.iloc[0:100,:]

    update_fs_venues(selection)
    fs_venue_df = fs_venue_df.groupby(["PC", "Venue Id"]).first().reset_index()
    fs_venue_df = fs_venue_df[venue_data_columns]
    save_fs_to_disk()

else:
    print("Not downloading, FS_LOAD_ALL and FS_LOAD_SELECTED both False.")

fs_venue_df.head()

Not downloading, FS_LOAD_ALL and FS_LOAD_SELECTED both False.


Unnamed: 0,PC,PC Latitude,PC Longitude,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,100,60.172207,24.92929,4adcdb1ff964a5208b5f21e3,Konditoria Café Briossi,60.16732,24.938287,Bakery
1,100,60.172207,24.92929,4adcdb1ff964a5208d5f21e3,Fazer Café,60.168481,24.947506,Café
2,100,60.172207,24.92929,4adcdb1ff964a520a65f21e3,Café Strindberg,60.16769,24.946243,Café
3,100,60.172207,24.92929,4adcdb20f964a520ce5f21e3,KuuKuu,60.1754,24.92519,Scandinavian Restaurant
4,100,60.172207,24.92929,4adcdb20f964a520cf5f21e3,St. Urho's Pub,60.17397,24.9315,Beer Bar


## Step 3 - Analysis of data

### Step 3.1 - Prepare for analysis: onehot

Analyze what we have and also cluster the data

In [24]:
print("So we have {} (rows, columns) of venue data for Finland".format(fs_venue_df.shape))
print('There are {} uniques categories.'.format(len(fs_venue_df['Venue Category'].unique())))


So we have (36690, 8) (rows, columns) of venue data for Finland
There are 461 uniques categories.


One hot -encoding and additionally, the top 10 most common types of venues for each postal code area.

One hot -encoding: create a column for each kind of venue that we have for the neighborhoods (onehot encoding). Then summarize them by the neighborhood.  And finally find the 10 most common of the for each neighborhood.

In [25]:
# Helper.  Whole is a list, from which we want to sort elements of top to the beginning.
# Parameters 'top' and 'whole' are both lists.  Every element of 'top' must be in 'whole'.
def list_order_to_top(top, whole):
    # Check that all elements in top are also in whole
    for t in top:
        if t not in whole:
            raise Exception("joo")
    rest = []
    for w in whole:
        if w not in top:
            rest = rest + [w]
    return top + rest

print("list_order_to_top() defined.")

list_order_to_top() defined.


In [26]:
# one hot encoding
pc_venues_onehot_df = pd.get_dummies(fs_venue_df[['Venue Category']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
pc_venues_onehot_df['PC'] = fs_venue_df['PC']

pc_venues_onehot_df = pc_venues_onehot_df[list_order_to_top(['PC'], pc_venues_onehot_df.columns.tolist())]

print(pc_venues_onehot_df.shape)
pc_venues_onehot_df.head()

(36690, 462)


Unnamed: 0,PC,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,100,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now *pc_venues_onehot_df* contains each venue as its own row.

Next get all postal code area venue offerings into one row by grouping them with PC, and keep their sum equal to one, meaning that if there are 2 airports, 2 trainstations and avideo store, both airport and trainstation columns get value 0.4 and video store would get 0.2 

In [27]:
pc_venues_onehot_grouped_df = pc_venues_onehot_df.groupby('PC').mean().reset_index()
print("PC venues grouped shape", pc_venues_onehot_grouped_df.shape)

# Merge the Borough -column into the data so we can use it to filter data later
#toronto_grouped = pd.merge(toronto_df[['Borough', 'Neighborhood']], toronto_grouped_no_borough, on="Neighborhood")
#print("toronto grouped shape (including borough-column)", toronto_grouped.shape)

pc_venues_onehot_grouped_df = pc_venues_onehot_grouped_df.merge(paavo_filtered_df[["PC", "PC Name", "City"]], on="PC")
pc_venues_onehot_grouped_df = pc_venues_onehot_grouped_df[list_order_to_top(["PC", "PC Name", "City"], pc_venues_onehot_grouped_df.columns.tolist())] 
pc_venues_onehot_grouped_df.head()

PC venues grouped shape (2093, 462)


Unnamed: 0,PC,PC Name,City,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.037975,0.006329,0.0,0.0,0.0,0.012658,0.0,0.0
1,120,Punavuori,Helsinki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.046053,0.0,0.0,0.0,0.0,0.006579,0.0,0.0
2,130,Kaartinkaupunki,Helsinki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.015748,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,140,Kaivopuisto - Ullanlinna,Helsinki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.007812,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,150,Eira - Hernesaari,Helsinki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.019355,0.0,0.0,0.0,0.0,0.006452,0.0,0.0


### Step 3.2 - 10 most common venues for each postal code area

In [28]:
#
# Helper function to focus attention on each neighborhoods most common venues
# Expect that unnecessary columns have been removed already
def return_most_common_venues(row, num_top_venues):
    row_categories = row.iloc[:]
    row_categories_sorted = row_categories.sort_values(ascending=False)
    
    return row_categories_sorted.index.values[0:num_top_venues]

print("return_most_common_venues() defined.")

return_most_common_venues() defined.


In [29]:
num_top_venues = 10

indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
# top_columns is used later, too
top_columns = []
for ind in np.arange(num_top_venues):
    try:
        top_columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
    except:
        top_columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
pc_venues_sorted_df = pd.DataFrame(columns=["PC"] + top_columns)
pc_venues_sorted_df['PC'] = pc_venues_onehot_grouped_df['PC']

for ind in np.arange(pc_venues_onehot_grouped_df.shape[0]):
    pc_venues_sorted_df.iloc[ind, 1:] = return_most_common_venues(pc_venues_onehot_grouped_df.iloc[ind, 3:], num_top_venues)

pc_venues_sorted_df.head()

Unnamed: 0,PC,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,7th Most Common Venue,8th Most Common Venue,9th Most Common Venue,10th Most Common Venue
0,100,Café,Coffee Shop,Scandinavian Restaurant,Sushi Restaurant,Wine Bar,Bakery,Bar,Beer Bar,Pub,Cocktail Bar
1,120,Café,Coffee Shop,Cocktail Bar,Wine Bar,Bar,Scandinavian Restaurant,Bakery,Sushi Restaurant,Park,Restaurant
2,130,Coffee Shop,Café,Scandinavian Restaurant,Cocktail Bar,Bar,Hotel,Restaurant,Mediterranean Restaurant,Pizza Place,Beer Garden
3,140,Café,Coffee Shop,Scandinavian Restaurant,Park,Bar,Bakery,Restaurant,Nightclub,Cocktail Bar,Italian Restaurant
4,150,Café,Bar,Coffee Shop,Scandinavian Restaurant,Park,Beer Bar,Restaurant,Wine Bar,Bakery,Italian Restaurant


### Step 3.3 - Filter data

Paavo and downloaded FourSquare data covers whole of Finland.  Filter out the top 20 cities.  Plus Savonlinna :)

In [30]:

# Turn filters on / off with True / False -values. For now only one filter
FILTER_POSTAL_CODE_AREAS = True

# Top 20 cities in finland (by population) + Savonlinna
filter_cities = ["Helsinki", "Espoo", "Tampere", "Vantaa", "Oulu", "Turku", "Jyväskylä", "Lahti", "Kuopio", "Pori", "Kouvola", "Joensuu", "Lappeenranta",
                 "Vaasa", "Hämeenlinna", "Seinäjoki", "Rovaniemi", "Mikkeli", "Kotka", "Salo", "Savonlinna"]

print("\n\n")

filter = None
if FILTER_POSTAL_CODE_AREAS:
    filt = [False for n in pc_venues_onehot_grouped_df['PC']]
    for city in filter_cities:
        print("Filtering for city: '{}'".format(city))
        city_filter = pc_venues_onehot_grouped_df['City'] == city
        filt = [c or f for c, f in zip(filt, city_filter)]
    filter = pd.Series(data = filt)
else:
    # Effectively no filter, but fill it so that it will pass all data through.
    filter = pd.Series(data = [True for n in pc_venues_onehot_grouped_df['PC']])

filter_passed_through = len([x for x in filter if x])

# the all() method is kind of 'and' operation for the whole series value, it returns True only
# if all of the values in the series are True.  Thus it means there is no filtering.
if filter.all():
    print("No data filtering defined.\n")
else:
    print("Filtering in use, proceeding to clustering with", filter_passed_through, "cases out of", pc_venues_onehot_grouped_df.shape[0], "possible cases.\n")





Filtering for city: 'Helsinki'
Filtering for city: 'Espoo'
Filtering for city: 'Tampere'
Filtering for city: 'Vantaa'
Filtering for city: 'Oulu'
Filtering for city: 'Turku'
Filtering for city: 'Jyväskylä'
Filtering for city: 'Lahti'
Filtering for city: 'Kuopio'
Filtering for city: 'Pori'
Filtering for city: 'Kouvola'
Filtering for city: 'Joensuu'
Filtering for city: 'Lappeenranta'
Filtering for city: 'Vaasa'
Filtering for city: 'Hämeenlinna'
Filtering for city: 'Seinäjoki'
Filtering for city: 'Rovaniemi'
Filtering for city: 'Mikkeli'
Filtering for city: 'Kotka'
Filtering for city: 'Salo'
Filtering for city: 'Savonlinna'
Filtering in use, proceeding to clustering with 679 cases out of 2093 possible cases.



### Step 3.4 - Clustering

1. Put Paavo-data and FourSquare data together for Kclustering.
2. Run clustering algorithm a few times to see how big value K can have so that clusters are still big enough (smallest resulting cluster has at least five (5) elements in it).
3. Use the found K to create the clustering that we continue to explore

In [31]:
# 1. Gather data to use in CLUSTERING

paavo_fs_clustering_df = paavo_stan_df.merge(pc_venues_onehot_grouped_df[filter], on="PC")

print("Merging {} and {} columns together. Result has {} columns, it should be one less than the sum.".format(
    paavo_stan_df.shape[1], pc_venues_onehot_grouped_df[filter].shape[1], paavo_fs_clustering_df.shape[1]))
print("Merging {} and {} rows together. Result has {} columns, it should be the smaller of the two.".format(
    paavo_stan_df.shape[0], pc_venues_onehot_grouped_df[filter].shape[0], paavo_fs_clustering_df.shape[0]))

# the postal codes do no good when clustering, so don't include them
bare_clustering_data = paavo_fs_clustering_df.drop(['PC', 'PC Name', 'City'], axis=1, inplace=False)
bare_clustering_data.head()


Merging 104 and 464 columns together. Result has 567 columns, it should be one less than the sum.
Merging 2107 and 679 rows together. Result has 679 columns, it should be the smaller of the two.


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,-1.100272,-0.452328,4.776863,4.838401,4.69772,-0.568997,3.407284,2.724038,2.445841,1.874344,...,0.0,0.0,0.037975,0.006329,0.0,0.0,0.0,0.012658,0.0,0.0
1,-1.105567,-0.462058,1.389098,1.4319,1.338992,-0.765921,1.029136,0.815641,0.658872,0.484515,...,0.0,0.0,0.046053,0.0,0.0,0.0,0.0,0.006579,0.0,0.0
2,-1.104971,-0.461983,-0.308423,-0.299244,-0.317618,-0.568997,-0.382594,-0.42116,-0.476645,-0.497112,...,0.0,0.0,0.015748,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-1.108892,-0.45946,1.618566,1.737567,1.484996,-0.568997,1.417598,0.922032,0.903167,0.785807,...,0.0,0.0,0.007812,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,-1.109901,-0.457275,2.112969,2.232678,1.977365,-0.765921,1.66394,1.134815,0.853403,0.397043,...,0.0,0.0,0.019355,0.0,0.0,0.0,0.0,0.006452,0.0,0.0


In [32]:
# 2. Run the clustering algorithm with different values of K to see what size of clusters emerge

proposed_k = 1

for loop_kclusters in range(2, 15):
    clusters = KMeans(n_clusters=loop_kclusters, random_state=0).fit(bare_clustering_data).labels_
    clusters_counts_d = find_uniques_with_counts(clusters, {}) # returns a dictionary
    clusters_counts_l = list(clusters_counts_d.values())
    clusters_counts_l.sort(reverse=True)
    if clusters_counts_l[-1] >= 5:
        proposed_k = loop_kclusters
    print("cluster count: {:>2}, cluster sizes (sorted by size): {}".format(loop_kclusters, clusters_counts_l))

print("\nProposing k = {} for clusters count (smallest clusters size is still at least 5 elements)".format(proposed_k))


cluster count:  2, cluster sizes (sorted by size): [531, 148]
cluster count:  3, cluster sizes (sorted by size): [455, 185, 39]
cluster count:  4, cluster sizes (sorted by size): [363, 220, 87, 9]
cluster count:  5, cluster sizes (sorted by size): [354, 209, 93, 16, 7]
cluster count:  6, cluster sizes (sorted by size): [346, 199, 76, 35, 16, 7]
cluster count:  7, cluster sizes (sorted by size): [295, 160, 92, 83, 27, 19, 3]
cluster count:  8, cluster sizes (sorted by size): [343, 201, 77, 24, 15, 12, 6, 1]
cluster count:  9, cluster sizes (sorted by size): [275, 127, 92, 80, 56, 27, 15, 6, 1]
cluster count: 10, cluster sizes (sorted by size): [293, 155, 93, 76, 33, 17, 6, 3, 2, 1]
cluster count: 11, cluster sizes (sorted by size): [264, 128, 95, 73, 70, 22, 17, 5, 3, 1, 1]
cluster count: 12, cluster sizes (sorted by size): [273, 131, 95, 82, 41, 18, 13, 10, 9, 5, 1, 1]
cluster count: 13, cluster sizes (sorted by size): [280, 188, 63, 59, 39, 19, 14, 6, 3, 3, 3, 1, 1]
cluster count: 14,

In [33]:
# 3. Set number of clusters to 6, it seems to be best for top20 cities (after this number of very small cluster just increases)
kclusters = proposed_k

# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(bare_clustering_data)

# check cluster labels generated for each row in the dataframe
print("\n\nClustered", len(kmeans.labels_), "postal code areas into", kclusters, "clusters.\n")

# Show some distributions after clustering
k_df = pd.DataFrame(kmeans.labels_)
k_df.columns = ['ClusterLabel']
k_df["count"] = np.ones(len(kmeans.labels_))
kc_df = k_df.groupby("ClusterLabel").count()
kc_df



Clustered 679 postal code areas into 6 clusters.



Unnamed: 0_level_0,count
ClusterLabel,Unnamed: 1_level_1
0,199
1,346
2,7
3,16
4,76
5,35


### Step 3.5 - See clustering results

To check out what the clustering found, inspect the data.  But first put together a combined dataset of Paavo and FourSquare data, and add cluster info back into it.


In [34]:

# We applied filtering for the clustering data, thus we need to apply filtering to the
# resulting data
f_pc_venues_sorted_df = pc_venues_sorted_df[filter].reset_index()

# add clustering labels
f_pc_venues_sorted_df.insert(0, 'Cluster Label', kmeans.labels_)

# merge to add latitude/longitude for each neighborhood
paavo_fs_merged_df = pd.merge(paavo_filtered_df.rename(columns={"PC Number": "PC"}), f_pc_venues_sorted_df, on='PC')
paavo_fs_merged_df = paavo_fs_merged_df.drop('index', 1)
print("\n\nThe merged, filtered and clustered data *paavo_fs_merged_df* contains", paavo_fs_merged_df.shape[0], "postal code areas and ",
      paavo_fs_merged_df.shape[1], " data columns.\n")

cols = paavo_fs_merged_df.columns
new_order_cols = list_order_to_top(["PC", "PC Name", "City", "Cluster Label"] + top_columns, cols)
paavo_fs_merged_df = paavo_fs_merged_df[new_order_cols]
paavo_fs_merged_df.head(10) # check the last columns!



The merged, filtered and clustered data *paavo_fs_merged_df* contains 679 postal code areas and  120  data columns.



Unnamed: 0,PC,PC Name,City,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,"T Kotitalouksien toiminta työnantajina; kotitalouksien eriyttämätön toiminta tavaroiden ja palveluiden tuottamiseksi omaan käyttöön, 2016 (TP)","U Kansainvälisten organisaatioiden ja toimielinten toiminta, 2016 (TP)","X Toimiala tuntematon, 2016 (TP)","Asukkaat yhteensä, 2016 (PT)","Työlliset, 2016 (PT)","Työttömät, 2016 (PT)","Lapset 0-14 -vuotiaat, 2016 (PT)","Opiskelijat, 2016 (PT)","Eläkeläiset, 2016 (PT)","Muut, 2016 (PT)"
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,2,Café,Coffee Shop,Scandinavian Restaurant,Sushi Restaurant,Wine Bar,Bakery,...,0,17,1,18035,10032,856,1812,1198,3326,811
1,120,Punavuori,Helsinki,5,Café,Coffee Shop,Cocktail Bar,Wine Bar,Bar,Scandinavian Restaurant,...,0,0,0,7055,3872,336,817,428,1242,360
2,130,Kaartinkaupunki,Helsinki,1,Coffee Shop,Café,Scandinavian Restaurant,Cocktail Bar,Bar,Hotel,...,0,12,0,1522,839,61,170,104,258,90
3,140,Kaivopuisto - Ullanlinna,Helsinki,4,Café,Coffee Shop,Scandinavian Restaurant,Park,Bar,Bakery,...,0,1,0,7934,4218,325,929,518,1519,425
4,150,Eira - Hernesaari,Helsinki,5,Café,Bar,Coffee Shop,Scandinavian Restaurant,Park,Beer Bar,...,7,8,0,9527,5433,536,942,564,1523,529
5,160,Katajanokka,Helsinki,0,Café,Bar,Scandinavian Restaurant,Restaurant,Boat or Ferry,Park,...,0,0,0,4448,2019,192,602,290,1148,197
6,170,Kruununhaka,Helsinki,5,Café,Coffee Shop,Scandinavian Restaurant,Bar,Beer Bar,Hotel Bar,...,0,234,0,7427,4099,345,892,507,1297,287
7,180,Kamppi - Ruoholahti,Helsinki,2,Café,Coffee Shop,Bar,Scandinavian Restaurant,Beer Bar,Pub,...,0,0,0,13555,7475,768,1618,859,2151,684
8,190,Suomenlinna,Helsinki,1,Café,History Museum,Park,Boat or Ferry,Scenic Lookout,Island,...,0,9,0,748,371,44,173,42,89,29
9,200,Lauttasaari,Helsinki,3,Bar,Café,Restaurant,Pizza Place,Grocery Store,Scandinavian Restaurant,...,0,0,0,15183,8094,714,2130,916,2810,519


#### Step 3.5.1 - Analysis 1 "Most common venue categories in clusters postal code areas"

In [35]:
import operator

threshold_percentage = 10
print("For each cluster, for each postal code area in that cluster, consider those venue categories that are mentioned in 1st, 2nd or 3rd most common ones.")
print("For each cluster, show categories that make it into the top 3 in most of the postal code areas.  For example, if a 'cafe' makes it into the top 3 for")
print("every postal code area in the cluster, then category 'cafe' should have 100% appearance in the analysis below.  Threshold for showing is", threshold_percentage, "%.")
for c_id in range(kclusters):
    clust = paavo_fs_merged_df[paavo_fs_merged_df["Cluster Label"] == c_id]
    cluster_size = clust.shape[0]
    cluster_threshold = (threshold_percentage / 100) * clust.shape[0]
    print("\nCluster {} has {} rows.".format(c_id, cluster_size))

    cat_counts = {}
    for pos in range(4,7):
        for category in clust.iloc[:,pos].tolist():
            if category in cat_counts.keys():
                cat_counts[category] = cat_counts[category] + 1
            else:
                cat_counts[category] = 1

    for i in (range(len(cat_counts.keys()))):
        key = max(cat_counts.items(), key=operator.itemgetter(1))[0]
        if cat_counts[key] > cluster_threshold:
            print("{:>25} -- {:3} ({:2}%)".format(key, cat_counts[key], int(100*(cat_counts[key]/cluster_size))))
        cat_counts.pop(key, None)

#clust.groupby("1st Most Common Venue").count()
#clust.head()


For each cluster, for each postal code area in that cluster, consider those venue categories that are mentioned in 1st, 2nd or 3rd most common ones.
For each cluster, show categories that make it into the top 3 in most of the postal code areas.  For example, if a 'cafe' makes it into the top 3 for
every postal code area in the cluster, then category 'cafe' should have 100% appearance in the analysis below.  Threshold for showing is 10 %.

Cluster 0 has 199 rows.
            Grocery Store --  81 (40%)
              Supermarket --  62 (31%)
              Pizza Place --  55 (27%)
                     Café --  49 (24%)
                 Bus Stop --  42 (21%)
                      Bar --  30 (15%)
               Restaurant --  20 (10%)

Cluster 1 has 346 rows.
            Grocery Store -- 101 (29%)
              Supermarket --  68 (19%)
                     Café --  54 (15%)
              Pizza Place --  44 (12%)
                    Hotel --  36 (10%)

Cluster 2 has 7 rows.
                 

Observations about typical postal code areas in clusters:
- clusters 0 and 1 are big and their typical top venues are about shopping and grocery stores.  Services like cafes or some sort of restaurant are on the list, but clearly behind the shopping super markets.
- clusters 2 and 3 are quite small and kind of marginal in this examination.  However, they seem to be quite clearly cafe and restaurant services oriented clusters.
- cluster 4 is kind of balanced between food/cafe services and supermarkets / grocery stores.
- cluster 5 is the most clearly cafe and restaurant services oriented cluster of all.



#### Step 3.5.2 - Analysis 2: compare average (over postal code areas) numbers for each cluster

1. Use the previous "one hot" format of each venue category, and sum up venues in each category in each postal code area.
2. Create an additional summary (column) of cafe activity in the postal code areas.
3. Combine together all of Paavo data and this one hot summary of FourSquare venues.  NOTE this creates a dataframe of 584 columns (onehot encoding creates a lot of columns)
4. we are ready to show (and explore) the average number (per postal code area) of each feature **WITHIN** the cluster.  Purpose is to see if clusters differ from each other in some easily noticeable way.


In [36]:
# 1. revisit the onehot data, this time use 'sum' to summarize the data by postal code area

pc_venues_onehot_grouped_s_df = pc_venues_onehot_df.groupby('PC').sum().reset_index()
print("PC venues grouped (count) shape", pc_venues_onehot_grouped_s_df.shape)

pc_venues_onehot_grouped_s_df = pc_venues_onehot_grouped_s_df.merge(paavo_filtered_df[["PC", "PC Name", "City"]], on="PC")
pc_venues_onehot_grouped_s_df = pc_venues_onehot_grouped_s_df[list_order_to_top(["PC", "PC Name", "City"], pc_venues_onehot_grouped_s_df.columns.tolist())] 
pc_venues_onehot_grouped_s_df.head()


PC venues grouped (count) shape (2093, 462)


Unnamed: 0,PC,PC Name,City,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,0,0,0,0,0,0,0,...,0,0,6,1,0,0,0,2,0,0
1,120,Punavuori,Helsinki,0,0,0,0,0,0,0,...,0,0,7,0,0,0,0,1,0,0
2,130,Kaartinkaupunki,Helsinki,0,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,0,0
3,140,Kaivopuisto - Ullanlinna,Helsinki,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,150,Eira - Hernesaari,Helsinki,0,0,0,0,0,0,0,...,0,0,3,0,0,0,0,1,0,0


In [37]:
# 2. create one additional column to summarize 'cafe' activity in the area

# vcf = venue category filter
vcf_columns = []
look_for_these = ["cafe", "coffee"]
for c in pc_venues_onehot_grouped_s_df.columns:
    for l in look_for_these:
        if l in c.lower():
            vcf_columns = vcf_columns + [c]

# Create summary column, initialize with zeros, and then loop the vcf_columns to sum it up
pc_venues_onehot_grouped_s_df["Cafe Summary"] = np.zeros(pc_venues_onehot_grouped_s_df.shape[0])
for c in vcf_columns:
    pc_venues_onehot_grouped_s_df["Cafe Summary"] = pc_venues_onehot_grouped_s_df["Cafe Summary"] + pc_venues_onehot_grouped_s_df[c]

pc_venues_onehot_grouped_s_df["Cafe Summary"] = pc_venues_onehot_grouped_s_df["Cafe Summary"].astype('int')
print("\nCreated 'Cafe Summary' column by summing up following columns:", vcf_columns, "\n")

vcf_columns = ["PC"] + vcf_columns + ["Cafe Summary"]
pc_venues_onehot_grouped_s_df[vcf_columns].head()


Created 'Cafe Summary' column by summing up following columns: ['Cafeteria', 'Coffee Shop', 'College Cafeteria'] 



Unnamed: 0,PC,Cafeteria,Coffee Shop,College Cafeteria,Cafe Summary
0,100,0,17,0,17
1,120,0,17,1,18
2,130,0,16,0,16
3,140,0,12,0,12
4,150,1,13,1,15


In [38]:
# 2.1 create one additional column to summarize 'restaurant' activity in the area

# vcf = venue category filter
vcf_columns = []
look_for_these = ["pizza", 'restaurant', 'blini', 'breakfast', 'buffet', 'burger', 'burrito', 'diner', 'food truck', 'fried chicken', 'noodle house', 'sandwich'
                 'steakhouse', 'taco', 'wings joint']
for c in pc_venues_onehot_grouped_s_df.columns:
    for l in look_for_these:
        if l in c.lower():
            vcf_columns = vcf_columns + [c]

# Create summary column, initialize with zeros, and then loop the vcf_columns to sum it up
pc_venues_onehot_grouped_s_df["Restaurant Summary"] = np.zeros(pc_venues_onehot_grouped_s_df.shape[0])
for c in vcf_columns:
    pc_venues_onehot_grouped_s_df["Restaurant Summary"] = pc_venues_onehot_grouped_s_df["Restaurant Summary"] + pc_venues_onehot_grouped_s_df[c]

pc_venues_onehot_grouped_s_df["Restaurant Summary"] = pc_venues_onehot_grouped_s_df["Restaurant Summary"].astype('int')
print("\nCreated 'Restaurant Summary' column by summing up following columns:", vcf_columns, "\n")

vcf_columns = ["PC", "Restaurant Summary"]
pc_venues_onehot_grouped_s_df[vcf_columns].head()


Created 'Restaurant Summary' column by summing up following columns: ['Afghan Restaurant', 'African Restaurant', 'American Restaurant', 'Asian Restaurant', 'Australian Restaurant', 'Austrian Restaurant', 'Bed & Breakfast', 'Belgian Restaurant', 'Blini House', 'Brazilian Restaurant', 'Breakfast Spot', 'Buffet', 'Burger Joint', 'Burrito Place', 'Cambodian Restaurant', 'Caucasian Restaurant', 'Chinese Restaurant', 'Comfort Food Restaurant', 'Dim Sum Restaurant', 'Diner', 'Doner Restaurant', 'Dumpling Restaurant', 'Eastern European Restaurant', 'Ethiopian Restaurant', 'Falafel Restaurant', 'Fast Food Restaurant', 'Filipino Restaurant', 'Food Truck', 'French Restaurant', 'Fried Chicken Joint', 'German Restaurant', 'Greek Restaurant', 'Halal Restaurant', 'Hawaiian Restaurant', 'Himalayan Restaurant', 'Hungarian Restaurant', 'Indian Restaurant', 'Indonesian Restaurant', 'Italian Restaurant', 'Japanese Restaurant', 'Kebab Restaurant', 'Korean Restaurant', 'Kurdish Restaurant', 'Lebanese Resta

Unnamed: 0,PC,Restaurant Summary
0,100,41
1,120,43
2,130,41
3,140,41
4,150,39


In [39]:
# 3. merge...
paavo_fs_merged_df = pd.merge(paavo_fs_merged_df, pc_venues_onehot_grouped_s_df.drop(['PC Name', 'City'], axis=1, inplace=False), on="PC")
print("Shape of merged data:", paavo_fs_merged_df.shape)
# paavo_fs_merged_df.head()

Shape of merged data: (679, 583)


In [40]:
# 4. Show average numbers (over postal code areas) for each cluster

# Show floats in two decimals
# Note: this has global effect on all DataFrames!
pd.options.display.float_format = '{:,.2f}'.format

clus_df = paavo_fs_merged_df
clus_df = clus_df.groupby("Cluster Label").mean()
clus_df = clus_df.drop(columns=["PC Longitude", "PC Latitude", "X-koordinaatti metreinä", "Y-koordinaatti metreinä"])
# convert value from square meters to square kilometers
clus_df["Postinumeroalueen pinta-ala"] = clus_df["Postinumeroalueen pinta-ala"] / 1000000
#clus_df = clus_df.astype("int")

#
# Change values in following columns into percentiges, which makes relative comparisons easier
#
clus_df.iloc[:, 26] = clus_df.iloc[:,26] / clus_df.iloc[:,25]
clus_df.iloc[:, 27] = clus_df.iloc[:,27] / clus_df.iloc[:,25]
clus_df.iloc[:, 28] = clus_df.iloc[:,28] / clus_df.iloc[:,25]
clus_df.iloc[:, 29] = clus_df.iloc[:,29] / clus_df.iloc[:,25]
clus_df.iloc[:, 30] = clus_df.iloc[:,30] / clus_df.iloc[:,25]
clus_df.iloc[:, 31] = clus_df.iloc[:,31] / clus_df.iloc[:,25]
clus_df.iloc[:, 35] = clus_df.iloc[:,35] / clus_df.iloc[:,32]
clus_df.iloc[:, 36] = clus_df.iloc[:,36] / clus_df.iloc[:,32]
clus_df.iloc[:, 37] = clus_df.iloc[:,37] / clus_df.iloc[:,32]
clus_df.iloc[:, 42] = clus_df.iloc[:,42] / clus_df.iloc[:,39]
clus_df.iloc[:, 43] = clus_df.iloc[:,43] / clus_df.iloc[:,39]
clus_df.iloc[:, 44] = clus_df.iloc[:,44] / clus_df.iloc[:,39]
clus_df.iloc[:, 45] = clus_df.iloc[:,45] / clus_df.iloc[:,39]
clus_df.iloc[:, 46] = clus_df.iloc[:,46] / clus_df.iloc[:,39]
clus_df.iloc[:, 47] = clus_df.iloc[:,47] / clus_df.iloc[:,39]
clus_df.iloc[:, 48] = clus_df.iloc[:,48] / clus_df.iloc[:,39]
clus_df.iloc[:, 49] = clus_df.iloc[:,49] / clus_df.iloc[:,39]
clus_df.iloc[:, 50] = clus_df.iloc[:,50] / clus_df.iloc[:,39]
clus_df.iloc[:, 51] = clus_df.iloc[:,51] / clus_df.iloc[:,39]
clus_df.iloc[:, 52] = clus_df.iloc[:,52] / clus_df.iloc[:,39]
clus_df.iloc[:, 53] = clus_df.iloc[:,53] / clus_df.iloc[:,39]
clus_df.iloc[:, 57] = clus_df.iloc[:,57] / clus_df.iloc[:,54]
clus_df.iloc[:, 58] = clus_df.iloc[:,58] / clus_df.iloc[:,54]
clus_df.iloc[:, 59] = clus_df.iloc[:,59] / clus_df.iloc[:,54]
clus_df.iloc[:, 70] = clus_df.iloc[:,70] / clus_df.iloc[:,69]
clus_df.iloc[:, 71] = clus_df.iloc[:,71] / clus_df.iloc[:,69]
clus_df.iloc[:, 72] = clus_df.iloc[:,72] / clus_df.iloc[:,69]
clus_df.iloc[:, 96] = clus_df.iloc[:,96] / clus_df.iloc[:,95]
clus_df.iloc[:, 97] = clus_df.iloc[:,97] / clus_df.iloc[:,95]
clus_df.iloc[:, 98] = clus_df.iloc[:,98] / clus_df.iloc[:,95]
clus_df.iloc[:, 99] = clus_df.iloc[:,99] / clus_df.iloc[:,95]
clus_df.iloc[:, 100] = clus_df.iloc[:,100] / clus_df.iloc[:,95]
clus_df.iloc[:, 101] = clus_df.iloc[:,101] / clus_df.iloc[:,95]


Below is data to explore.  Because of many aspects in Paavo data, this exploration is shown in many tables.

In [41]:
print("For each cluster, average data of its postal code areas. Columns 6-10 are percentiles of the amount in column 5.")
clus_df.iloc[:, [0,1,2,3,25,26,28,29,30,31]]


For each cluster, average data of its postal code areas. Columns 6-10 are percentiles of the amount in column 5.


Unnamed: 0_level_0,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2017 (HE)","Naiset, 2017 (HE)","Miehet, 2017 (HE)","18 vuotta täyttäneet yhteensä, 2017 (KO)","Perusasteen suorittaneet, 2017 (KO)","Ylioppilastutkinnon suorittaneet, 2017 (KO)","Ammatillisen tutkinnon suorittaneet, 2017 (KO)","Alemman korkeakoulututkinnon suorittaneet, 2017 (KO)","Ylemmän korkeakoulututkinnon suorittaneet, 2017 (KO)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,30.66,4815.0,2469.33,2345.67,3892.64,0.23,0.08,0.43,0.13,0.13
1,69.04,1172.63,583.97,588.66,926.0,0.23,0.06,0.49,0.12,0.11
2,4.72,18099.43,9604.86,8494.57,16482.0,0.16,0.17,0.29,0.17,0.2
3,8.2,18948.06,9812.94,9135.12,15599.75,0.25,0.11,0.38,0.13,0.14
4,15.13,9850.05,5104.33,4745.72,7789.43,0.23,0.09,0.38,0.14,0.16
5,10.85,8579.4,4588.6,3990.8,7609.71,0.2,0.13,0.36,0.15,0.16


In [42]:
print("For each cluster, average data of its postal code areas. Columns 4-6 are percentiles of the amount in column 1.")
clus_df.iloc[:, 32:39]


For each cluster, average data of its postal code areas. Columns 4-6 are percentiles of the amount in column 1.


Unnamed: 0_level_0,"18 vuotta täyttäneet yhteensä, 2016 (HR)","Asukkaiden keskitulot, 2016 (HR)","Asukkaiden mediaanitulot, 2016 (HR)","Alimpaan tuloluokkaan kuuluvat asukkaat, 2016 (HR)","Keskimmäiseen tuloluokkaan kuuluvat asukkaat, 2016 (HR)","Ylimpään tuloluokkaan kuuluvat asukkaat, 2016 (HR)","Asukkaiden ostovoimakertymä, 2016 (HR)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,3858.22,24219.65,21562.18,0.2,0.59,0.21,93156310.95
1,924.05,24321.08,21474.72,0.19,0.59,0.22,23414483.59
2,16345.0,26244.14,21244.71,0.22,0.55,0.23,425237865.0
3,15368.75,23810.06,21190.5,0.21,0.59,0.2,361045304.69
4,7720.92,25813.08,22871.83,0.19,0.57,0.24,196468103.32
5,7525.94,24035.54,20430.29,0.22,0.59,0.2,180457429.54


In [43]:
print("For each cluster, average data of its postal code areas. Columns 4-15 are percentiles of the amount in column 1.")
clus_df.iloc[:, 39:54]

For each cluster, average data of its postal code areas. Columns 4-15 are percentiles of the amount in column 1.


Unnamed: 0_level_0,"Taloudet yhteensä, 2017 (TE)","Talouksien keskikoko, 2017 (TE)","Asumisväljyys, 2017 (TE)","Nuorten yksinasuvien taloudet, 2017 (TE)","Lapsettomat nuorten parien taloudet, 2017 (TE)","Lapsitaloudet, 2017 (TE)","Pienten lasten taloudet, 2017 (TE)","Alle kouluikäisten lasten taloudet, 2017 (TE)","Kouluikäisten lasten taloudet, 2017 (TE)","Teini-ikäisten lasten taloudet, 2017 (TE)","Aikuisten taloudet, 2017 (TE)","Eläkeläisten taloudet, 2017 (TE)","Omistusasunnoissa asuvat taloudet, 2017 (TE)","Vuokra-asunnoissa asuvat taloudet, 2017 (TE)","Muissa asunnoissa asuvat taloudet, 2017 (TE)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,2424.19,2.02,38.81,0.13,0.05,0.21,0.05,0.11,0.1,0.08,0.51,0.28,0.57,0.41,0.02
1,538.14,2.19,43.94,0.07,0.03,0.24,0.06,0.11,0.12,0.1,0.44,0.32,0.74,0.24,0.02
2,11254.71,1.6,36.03,0.25,0.1,0.09,0.03,0.05,0.03,0.03,0.67,0.24,0.41,0.57,0.02
3,9969.56,1.89,35.67,0.15,0.07,0.19,0.06,0.1,0.08,0.07,0.55,0.26,0.46,0.53,0.02
4,4815.55,2.07,36.15,0.12,0.05,0.24,0.06,0.12,0.11,0.08,0.52,0.25,0.53,0.45,0.02
5,5239.4,1.64,36.29,0.22,0.08,0.11,0.03,0.06,0.04,0.04,0.62,0.27,0.43,0.55,0.02


In [44]:
print("For each cluster, average data of its postal code areas.")
clus_df.iloc[:, 54:57]

For each cluster, average data of its postal code areas.


Unnamed: 0_level_0,"Taloudet yhteensä, 2016 (TR)","Talouksien keskitulot, 2016 (TR)","Talouksien mediaanitulot, 2016 (TR)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,2385.61,40075.79,33659.33
1,535.05,43001.26,36945.64
2,11075.71,38984.86,28938.0
3,9767.12,37409.5,30195.62
4,4749.12,42548.13,35086.05
5,5137.94,35647.46,27668.29


In [45]:
print("For each cluster, average data of its postal code areas.")
clus_df.iloc[:, 61:69]

For each cluster, average data of its postal code areas.


Unnamed: 0_level_0,"Kesämökit yhteensä, 2017 (RA)","Rakennukset yhteensä, 2017 (RA)","Muut rakennukset yhteensä, 2017 (RA)","Asuinrakennukset yhteensä, 2017 (RA)","Asunnot, 2017 (RA)","Asuntojen keskipinta-ala, 2017 (RA)","Pientaloasunnot, 2017 (RA)","Kerrostaloasunnot, 2017 (RA)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,68.37,904.85,113.1,791.75,2634.97,77.47,1090.05,1544.93
1,165.35,423.55,63.72,359.83,597.16,94.41,427.79,169.37
2,25.0,825.29,300.57,524.71,12920.86,56.86,500.29,12420.57
3,24.19,1614.69,188.81,1425.88,10617.5,66.79,2154.94,8462.56
4,61.2,1272.91,108.39,1164.51,5126.95,74.43,1705.18,3421.76
5,49.37,752.83,221.03,531.8,5831.54,58.92,587.71,5243.83


In [46]:
print("For each cluster, average data of its postal code areas. Columns 2-4 are percentiles of the amount in column 1.")
clus_df.iloc[:, 69:73]

For each cluster, average data of its postal code areas. Columns 2-4 are percentiles of the amount in column 1.


Unnamed: 0_level_0,"Työpaikat yhteensä, 2016 (TP)","Alkutuotannon työpaikat, 2016 (TP)","Jalostuksen työpaikat, 2016 (TP)","Palveluiden työpaikat, 2016 (TP)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,1603.55,0.01,0.26,0.73
1,623.27,0.02,0.22,0.76
2,24132.86,0.0,0.09,0.91
3,4512.0,0.0,0.17,0.82
4,2081.25,0.0,0.16,0.83
5,9487.89,0.0,0.13,0.86


In [47]:
print("For each cluster, average data of its postal code areas. Columns 2-7 are percentiles of the amount in column 1.")
clus_df.iloc[:, 95:102]

For each cluster, average data of its postal code areas. Columns 2-7 are percentiles of the amount in column 1.


Unnamed: 0_level_0,"Asukkaat yhteensä, 2016 (PT)","Työlliset, 2016 (PT)","Työttömät, 2016 (PT)","Lapset 0-14 -vuotiaat, 2016 (PT)","Opiskelijat, 2016 (PT)","Eläkeläiset, 2016 (PT)","Muut, 2016 (PT)"
Cluster Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,4780.61,0.42,0.07,0.16,0.08,0.23,0.03
1,1172.67,0.41,0.06,0.18,0.07,0.25,0.03
2,17920.43,0.51,0.07,0.08,0.1,0.22,0.03
3,18680.94,0.42,0.08,0.15,0.09,0.22,0.04
4,9766.41,0.44,0.07,0.18,0.08,0.19,0.03
5,8489.69,0.45,0.07,0.1,0.09,0.25,0.03


Observations:
- cluster 1 postal code areas are on average by far the largest, while their population is also the smallest.  Cluster 0 follows cluster 1 on this, but not so heavily.  Yet both are remarkably different from other clusters in these regards.  In the field of education balance, these clusters are clearly less academic and more hands on, education that prepares directly for some profession like cook, mechanic, constructions etc.
- Cluster 2 has on average the smallest postal code areas and most dense by population, and is on average the most educated, clearly geared towards academic education.  It has proportionally the least kids (age under 15) on average, but then again most studends by a slight margin. Cluster 2 postal code areas have also the most jobs on average, by a clear margin, and they are mostly service area professions.  Economically cluster 2 areas have highest average income.
- Cluster 3 follows cluster 2, but now in such extremes.  Postal code areas are still rather small, and these have on average the most population, but not so densily populated (almost twice the area / population when compared to cluster 2).  Proportion of kids in these postal code areas is average in comparison to other clusters.  These postal codes are average on the job amounts they offer.  Cluster 3 areas have lowest average income.
- Clusters 4 and 5 are between above clusters, but closer to clusters 2 and 3 than 0 and 1.  Cluster 5 postal code areas are on average a bit smaller than cluster 4 counterparts, and cluster 5 has a small edge in academic education (mainly in high school level, not university level).  What divides clusters 4 and 5 is that cluster 5 postal code areas offer twice more jobs than cluster 4 counterparts, but cluster 4 inhabitants have on average better income than inhabitants in cluster 5.

It would appear that cluster 2 is areas in big city centers, and cluster 3 is perhaps the neighborhood around these center / downtown areas.  And similarly cluster 5 would be average city central areas, while cluster 4 are the surrounding neighborhoods.  Clusters 0 and 1 are rather on the rural side of the scale.  Perhaps also in the smaller cities the well earning people like to live a bit outside the center, where as in the biggest cities the best earning people live in the very centres.

#### Step 3.5.3 - Analysis 3: Visual inspection of clusters on map

Each dot on the map marks a postal code area that was involved in analysis. Dot is located roughly in the center of the postal code area.

Cluster colors:
- cluster 0: blue
- cluster 1: brown
- cluster 2: black
- cluster 3: grey
- cluster 4: pink
- cluster 5: red

Zoom into the map to see in more detail!

In [48]:

my_color = "#050505"
#finland_latitude = 62.777754
finland_latitude = 65.0
finland_longitude = 26.199539
map_test = folium.Map(location=[finland_latitude, finland_longitude], zoom_start=5)

cluster_colors = ["#0505ff", "#cc8305", "#050505", "#707070", "#ff7575", "#ff0000"]
for i in range(paavo_fs_merged_df.shape[0]):
#for i in range(5):
    row = paavo_fs_merged_df.iloc[i,:]
#    print(row[0], row[1], row[2], row[14], row[15])
    
    # show postal code center location
    label = folium.Popup(row[0] + " " + row[1] + ", " + row[2] + ", cluster " + str(row[3]))
    my_color = cluster_colors[row[3]]
    folium.CircleMarker(
        [row[15], row[14]],
        radius=2,
        popup=label,
        color=my_color,
        fill_color=my_color,
        fill_opacity=0.8).add_to(map_test)

map_test

Observations:

Cluster 2 does indeed seem to be in the very center of biggest cities.  Then again cluster 5 is not just in smaller cities, but there are plenty cluster 5 areas in big cities as well.

Another observation is that these classifications and above attempt to understand them based on averages cannot be blindly trusted.  For example, one of Espoo's most wealthy inhabitan areas 'Westend' is classified to cluster 1.  It is likely to hold true for it that there are not so many services in the area, as it is mainly a neighborhood of big houses and estates, but clustering it together with actually dying rural areas is a sign that we need to watch out what we can deduce from average numbers.



#### Step 3.5.4 - Analysis 4: Compare citywise cluster percentages etc.

Check out how clusters are presented in each city.  That is, how the postal code areas in each city map to different clusters.

In [49]:
paavo_fs_merged_df.iloc[0:5, 18:20]

Unnamed: 0,Postinumeroalueen pinta-ala,"Asukkaat yhteensä, 2017 (HE)"
0,2353278,18284
1,414010,7108
2,428960,1508
3,931841,7865
4,1367328,9496


In [63]:
# use onehot encoding to be able to get cluster values into separate columns 
city_clusters_onehot_df = pd.get_dummies(paavo_fs_merged_df['Cluster Label'])
city_clusters_onehot_df['City'] = paavo_fs_merged_df['City']
city_clusters_onehot_df['Area (km2)'] = paavo_fs_merged_df.iloc[:, 18] / 1000000
city_clusters_onehot_df['Area (km2)'] = city_clusters_onehot_df['Area (km2)'].astype('int')
city_clusters_onehot_df['Population'] = paavo_fs_merged_df.iloc[:, 19].astype('int')
city_clusters_onehot_df['Cafe Summary'] = paavo_fs_merged_df['Cafe Summary'].astype('int')
city_clusters_onehot_df['Restaurant Summary'] = paavo_fs_merged_df['Restaurant Summary'].astype('int')

city_clusters_onehot_df['PC Area count'] = np.ones(city_clusters_onehot_df.shape[0]).astype('int')

city_clusters_onehot_df = city_clusters_onehot_df.groupby('City').sum()

# change cluster counts into percent values
for i in range(0, 6):
    None
    city_clusters_onehot_df[i] = 100 * (city_clusters_onehot_df[i] / city_clusters_onehot_df['PC Area count'])

city_clusters_onehot_df.sort_values(by=['Population'], ascending=False, inplace=True)
city_clusters_onehot_df["Biggest in Finland (population)"] = np.array(list(range(1, city_clusters_onehot_df.shape[0] + 1)))
city_clusters_onehot_df['Population by cafe'] = city_clusters_onehot_df['Population'] / city_clusters_onehot_df['Cafe Summary']
city_clusters_onehot_df['Population by cafe'] = city_clusters_onehot_df['Population by cafe'].astype('int')
city_clusters_onehot_df['Population by restaurant'] = city_clusters_onehot_df['Population'] / city_clusters_onehot_df['Restaurant Summary']
city_clusters_onehot_df['Population by restaurant'] = city_clusters_onehot_df['Population by restaurant'].astype('int')

city_clusters_onehot_df

Unnamed: 0_level_0,0,1,2,3,4,5,Area (km2),Population,Cafe Summary,Restaurant Summary,PC Area count,Biggest in Finland (population),Population by cafe,Population by restaurant
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Helsinki,32.53,16.87,3.61,3.61,31.33,12.05,172,628760,328,1351,83,1,1916,465
Espoo,37.78,28.89,0.0,4.44,22.22,6.67,331,273029,137,457,45,2,1992,597
Tampere,29.41,32.35,2.94,8.82,14.71,11.76,735,223117,43,397,34,3,5188,562
Vantaa,28.57,37.14,0.0,8.57,22.86,2.86,226,215273,71,238,35,4,3032,904
Oulu,35.9,35.9,0.0,0.0,23.08,5.13,2592,199340,18,288,39,5,11074,692
Turku,38.46,23.08,3.85,11.54,15.38,7.69,279,188966,40,356,26,6,4724,530
Jyväskylä,43.75,40.62,3.12,3.12,6.25,3.12,1290,139297,37,196,32,7,3764,710
Kuopio,24.0,72.0,0.0,0.0,2.0,2.0,3707,115608,15,161,50,8,7707,718
Lahti,51.72,37.93,0.0,3.45,3.45,3.45,452,112303,29,187,29,9,3872,600
Pori,40.0,56.67,0.0,0.0,0.0,3.33,1147,83485,4,94,30,10,20871,888


One observation here is that only eight (8) of the cities have at all cluster 2 and 3 areas.  And without the exception of Vaasa cities have all kinds of cluster areas present, at least for a small degree.  These cities are mainly the upper end of the top 20 cities in Finland (measured by population)


In [64]:
city_clusters_onehot_df[(city_clusters_onehot_df[2] > 0) | (city_clusters_onehot_df[3] > 0)]

Unnamed: 0_level_0,0,1,2,3,4,5,Area (km2),Population,Cafe Summary,Restaurant Summary,PC Area count,Biggest in Finland (population),Population by cafe,Population by restaurant
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Helsinki,32.53,16.87,3.61,3.61,31.33,12.05,172,628760,328,1351,83,1,1916,465
Espoo,37.78,28.89,0.0,4.44,22.22,6.67,331,273029,137,457,45,2,1992,597
Tampere,29.41,32.35,2.94,8.82,14.71,11.76,735,223117,43,397,34,3,5188,562
Vantaa,28.57,37.14,0.0,8.57,22.86,2.86,226,215273,71,238,35,4,3032,904
Turku,38.46,23.08,3.85,11.54,15.38,7.69,279,188966,40,356,26,6,4724,530
Jyväskylä,43.75,40.62,3.12,3.12,6.25,3.12,1290,139297,37,196,32,7,3764,710
Lahti,51.72,37.93,0.0,3.45,3.45,3.45,452,112303,29,187,29,9,3872,600
Vaasa,53.85,30.77,7.69,0.0,7.69,0.0,331,65928,11,97,13,14,5993,679


Of the other cities, not all have cluster 4 areas.  These cities do have:

In [65]:
city_clusters_onehot_df[(city_clusters_onehot_df[2] == 0) & (city_clusters_onehot_df[3] == 0) & (city_clusters_onehot_df[4] > 0)]

Unnamed: 0_level_0,0,1,2,3,4,5,Area (km2),Population,Cafe Summary,Restaurant Summary,PC Area count,Biggest in Finland (population),Population by cafe,Population by restaurant
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Oulu,35.9,35.9,0.0,0.0,23.08,5.13,2592,199340,18,288,39,5,11074,692
Kuopio,24.0,72.0,0.0,0.0,2.0,2.0,3707,115608,15,161,50,8,7707,718
Kouvola,21.62,72.97,0.0,0.0,2.7,2.7,2402,79235,15,117,37,11,5282,677
Lappeenranta,17.24,72.41,0.0,0.0,6.9,3.45,1397,71942,15,206,29,12,4796,349
Joensuu,33.33,52.38,0.0,0.0,9.52,4.76,1665,71126,14,112,21,13,5080,635
Hämeenlinna,19.05,66.67,0.0,0.0,9.52,4.76,1860,65860,11,105,21,15,5987,627
Seinäjoki,13.64,72.73,0.0,0.0,9.09,4.55,1349,61880,10,197,22,16,6188,314


Cities left out of the two above 'groups' are following.  They are quite heavily cluster 1 oriented cities.  Also, they are mainly the lower end of the top 20 cities in Finland (measured by population)

In [66]:
city_clusters_onehot_df[(city_clusters_onehot_df[2] == 0) & (city_clusters_onehot_df[3] == 0) & (city_clusters_onehot_df[4] == 0)]

Unnamed: 0_level_0,0,1,2,3,4,5,Area (km2),Population,Cafe Summary,Restaurant Summary,PC Area count,Biggest in Finland (population),Population by cafe,Population by restaurant
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Pori,40.0,56.67,0.0,0.0,0.0,3.33,1147,83485,4,94,30,10,20871,888
Rovaniemi,30.43,65.22,0.0,0.0,0.0,4.35,4108,59330,8,111,23,17,7416,534
Mikkeli,12.12,84.85,0.0,0.0,0.0,3.03,2719,52750,6,70,33,18,8791,753
Kotka,19.05,76.19,0.0,0.0,0.0,4.76,201,52115,4,44,21,19,13028,1184
Salo,19.35,77.42,0.0,0.0,0.0,3.23,1821,50788,8,109,31,20,6348,465
Savonlinna,12.0,88.0,0.0,0.0,0.0,0.0,2561,32531,3,120,25,21,10843,271


In [55]:

city_venues_df = pc_venues_onehot_df.groupby('PC').sum()
city_venues_df.head()
#pc_venues_onehot_df.head()


Unnamed: 0_level_0,ATM,Accessories Store,Adult Boutique,Afghan Restaurant,African Restaurant,Airport,Airport Lounge,Airport Service,Airport Terminal,American Restaurant,...,Whisky Bar,Windmill,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit
PC,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
100,0,0,0,0,0,0,0,0,0,0,...,0,0,6,1,0,0,0,2,0,0
120,0,0,0,0,0,0,0,0,0,1,...,0,0,7,0,0,0,0,1,0,0
130,0,0,0,0,0,0,0,0,0,1,...,0,0,2,0,0,0,0,0,0,0
140,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
150,0,0,0,0,0,0,0,0,0,1,...,0,0,3,0,0,0,0,1,0,0


In [56]:
paavo_fs_merged_df.head()

Unnamed: 0,PC,PC Name,City,Cluster Label,1st Most Common Venue,2nd Most Common Venue,3rd Most Common Venue,4th Most Common Venue,5th Most Common Venue,6th Most Common Venue,...,Wine Bar,Wine Shop,Winery,Wings Joint,Women's Store,Yoga Studio,Zoo,Zoo Exhibit,Cafe Summary,Restaurant Summary
0,100,Helsinki Keskusta - Etu-Töölö,Helsinki,2,Café,Coffee Shop,Scandinavian Restaurant,Sushi Restaurant,Wine Bar,Bakery,...,6,1,0,0,0,2,0,0,17,41
1,120,Punavuori,Helsinki,5,Café,Coffee Shop,Cocktail Bar,Wine Bar,Bar,Scandinavian Restaurant,...,7,0,0,0,0,1,0,0,18,43
2,130,Kaartinkaupunki,Helsinki,1,Coffee Shop,Café,Scandinavian Restaurant,Cocktail Bar,Bar,Hotel,...,2,0,0,0,0,0,0,0,16,41
3,140,Kaivopuisto - Ullanlinna,Helsinki,4,Café,Coffee Shop,Scandinavian Restaurant,Park,Bar,Bakery,...,1,0,0,0,0,0,0,0,12,41
4,150,Eira - Hernesaari,Helsinki,5,Café,Bar,Coffee Shop,Scandinavian Restaurant,Park,Beer Bar,...,3,0,0,0,0,1,0,0,15,39


#### Step 3.5.5 - Analysis 5: find correlations between coffees and other data

FInd out if there is a linear correlation between coffees and some other data features.

In [None]:
# run linear regression correlation between 'coffee sum' and other columns from column 17 onwards
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

print("Test linear regression accuracies between 'Cafe Summary' and other columns")
#y = np.array(paavo_fs_merged_df.iloc[:, 123]).reshape(-1, 1)
y = np.array(paavo_fs_merged_df["Cafe Summary"]).reshape(-1, 1)
test_scores = []
test_scores_i = 0        # helper index for debugging
for i in range(18, paavo_fs_merged_df.shape[1]):
    colname = paavo_fs_merged_df.columns.tolist()[i]
    if colname != "Cafe Summary":
        X = np.array(paavo_fs_merged_df.iloc[:, i]).reshape(-1, 1)

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

        reg = LinearRegression()
        reg = reg.fit(X_train, y_train)
        yhat = reg.predict(X_test)

        test_scores.append([test_scores_i, i, colname, reg.score(X_train, y_train), reg.score(X_test, y_test)])
        test_scores_i += 1
    
print("\nTop 20 best correlating columns: (descending order in training accuracy score)\n")
top_20 = []
for j in range(20):
    max_key = max(test_scores, key=operator.itemgetter(3))[0]
    top_20.append(test_scores[max_key])
    test_scores[max_key] = [0, 0, 0, 0, 0]
    #print(max_key, test_scores.pop(max_key), len(test_scores))

for k in range(len(top_20)):
    e = top_20[k]
    print("{1:2}. column index: {0[1]:3}     train accuracy: {0[3]:.2f}     test accuracy: {0[4]:5.2f}     column name: {0[2]}".format(e, k+1))
#    print(top_20[k])

In [None]:
colname = paavo_fs_merged_df.columns[356]
print("Scatterplot between {} and {}".format(colname, "Cafe Summary"))

paavo_fs_merged_df.plot(kind='scatter', x=colname, y='Cafe Summary', figsize=(10,6), color="darkblue")
plt.show()

In [None]:
colname = paavo_fs_merged_df.columns[49]
print("Scatterplot between {} and {}".format(colname, "Cafe Summary"))

paavo_fs_merged_df.plot(kind='scatter', x=colname, y='Cafe Summary', figsize=(10,6), color="darkblue")
plt.show()

In [None]:
import math

my_paikka = "99980"
map_test = None

filter = paavo_df["Postinumeroalue"].str.contains(my_paikka)
filtered_df = paavo_df[filter]
if filtered_df.shape[0] > 0:
    
    # look for my_paikka in paavo
    paikka_name = filtered_df.iloc[0,0]
    paikka_x = filtered_df.iloc[0,1]
    paikka_y = filtered_df.iloc[0,2]
    paikka_pinta_ala = filtered_df.iloc[0,3]
    paikka_radius = math.sqrt(int(paikka_pinta_ala))
    print("found: {}\nproceeding with coordinates x={}, y={}, area radius is {}m.".format(paikka_name, paikka_x, paikka_y, paikka_radius))

    # convert x,y coordinates to lon,lat
    p = pyproj.Proj(proj='utm',zone=35,ellps='WGS84') # use kwarg
    paikka_lon, paikka_lat = p(paikka_x, paikka_y, inverse=True)
    print("Converted coordinates: lat={}, lon={}".format(paikka_lat, paikka_lon))

    # show map
    my_color = "#050505"
    map_test = folium.Map(location=[paikka_lat, paikka_lon], zoom_start=12)
    
    # show postal code center location
    label = folium.Popup(paikka_name)
    folium.CircleMarker(
        [paikka_lat, paikka_lon],
        radius=35,
        popup=label,
        color=my_color,
        fill_color=my_color,
        fill_opacity=0.8).add_to(map_test)

    # show venues from FourSquare
    my_color = "#ff0525"
    for i in range(my_place_venues.shape[0]):
        v_name = my_place_venues.iloc[i, 3]
        v_lat = my_place_venues.iloc[i, 4]
        v_lon = my_place_venues.iloc[i, 5]
        v_cat = my_place_venues.iloc[i, 6]
        print(v_name, ", ", v_cat)
        label = folium.Popup(v_name + ", " + v_cat)
        folium.CircleMarker(
            [v_lat, v_lon],
            radius=80,
            popup=label,
            color=my_color,
            fill_color=my_color,
            fill_opacity=0.8).add_to(map_test)

    print("Ok, showing map.")

else:
    print("no match for {}".format(my_paikka))

map_test

In [None]:
import math
math.sqrt(4)

# Folium

In [None]:
# Comment / uncomment next line as needed.
#!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library


In [None]:
# create map

show_me = "utsjoki"

# Kaivari from above
kaivari_lat = 60.15777
kaivari_lon = 24.95355

utsjoki_lat = 69.90161
utsjoki_lon = 27.06928

kuopio_lat = 62.855642
kuopio_lon = 27.768541


lat = kaivari_lat
lon = kaivari_lon

if show_me == "utsjoki":
    lat = utsjoki_lat
    lon = utsjoki_lon
elif show_me == "kuopio":
    lat = kuopio_lat
    lon = kuopio_lon
    

map_test = folium.Map(location=[lat, lon], zoom_start=12)

label = folium.Popup("Kaivari")

folium.CircleMarker(
    [lat, lon],
    radius=35,
    popup=label,
    fill_opacity=0.8).add_to(map_test)


#    color=rainbow[cluster-1],
#    fill_color=rainbow[cluster-1],


map_test


In [None]:
# sample code to calculate values from A to X and sum them up
"{:,}".format(paavo_df.T.iloc[76:98,0].astype(int).sum())
