# Capstone Project Final Report

This notebook is the final project report for Coursera's _Applied Data Science Capstone_ -course.

# 1. Introduction

---

This reports studies the northern Europe country Finland, and more specifically its 20 biggest cities, measured by population.  The overall large question to answer is "_How do Finnish cities appear to someone willing to start a new cafe in Finland_".  The business problem is further divided into two parts: first the overall understanding of finnish cities, and then in more detail any insight about cafe opportunities within these cities.

First the audience... As already stated, the audience for this study is **anyone with an entrepreneurial spirit and cafe business creation intentions**.  This report is especially valuable information to immigrants in Finland with such intentions, as they do not necessarily have very good understanding of Finland to begin with.  However, this report will also serve native finnish people with same intentions, as usually people have a good understanding only of the nearby areas - in some cases more distant locations might provide more or better opportunities.

The first question of the overall understanding of finnish cities can be phrased as: "**What are the characteristics that differentiate the cities and how each city compares to each other in relation to these characteristics?**"  For example, is the population size the only differentiating factor between these cities, or are there also others, perhaps even more substantial characteristics?  Are there some characteristics that perhaps tell us something about the environment or the culture within these cities.  These characteristics can be used to identify potential needs in different areas, and help in profiling the new cafe service and finding it the best target audience.  For example, a cafe for university students and a cafe for construction workers may benefit from understanding the needs of these different client groups.

The second, more detailed question is "**Which other factors correlate with the offering of these cafe services**".  For example one assumption is that population should somehow correlate to the amount of cafes in the neighborhood.  The question then is, what factors seem good indicators - do the likes of education, income or presence of some other services nearby affect the demand for and offering of cafes?  The second question is to identify these factors.  Identifying these factors can then be used to search for good business locations.  For example, if these identified factors suggest that some area should have X cafes, but it has only Y, that area is perhaps a good opportunity for business.  


# 2. Data

---

This project uses two datasets, one is venue data provided by _Foursquare_, and the other is postal code area statistics dataset called Paavo, which is provided as open data from  _Statistics Finland_.


## 2.1 Foursquare data

---

Foursquare provides venue data near a given location.  A venue can be anything a foursquare user has entered into the service, like a cafe, restaurant, museum, park or a bus stop. Foursquare provides a lot of data about venues, like reviews of the venues, their location, type (category) etc.  Foursquare also provides information on users who report and evaluate venues, mainly focusing on user preferences.  Foursquare venue data is not open data, but it is available for free (to certain limits) for those who sign up as developers.  More information about Foursquare can be found at https://foursquare.com/

In this study we use only a small fraction of the venue data attributes available, focusing mostly on venue category information.


### 2.1.1 Downloading Foursquare venue data:

The overall data downloading process in short: For each postal code area, we call Foursquare's _explore_ -functionality to get a list of venues in that postal code area.  The results are stored in a Pandas DataFrame and then saved into a local file in CSV format for later analysis.

The explore functionality is called with postal code area centers' latitude and longitude information, and with a radius that is the square root of the postal code area size. Roughly, on average the radius is a bit on the long side, but this way we should be able to get all the venues. 

Foursquare's _explore_ functionality has a limit on the number of returned results, it is 50 venues at maximum.  So, if the result set for any postal code area has the maximum amount of 50 venues, it means we might not have received all venues there actually are.  In these situations, considering our target audience, we do three more download requests, this time focusing the additional requests with the _section_ attribute to values of _food_, _drinks_ and _cafe_.  This way we attempt to ensure that we should get at least most of the cafes and restaurants, if the location has a lot of venues close to the coordinates.  Despite this extra attempt, it is not guaranteed that we get all venues there are.

More information about Foursquare's explore -functionality (or endpoint as they call it): https://developer.foursquare.com/docs/api/venues/explore

After downloading venue data for each postal code area in Finland, we had **36690 venue entries** in our CSV file.


### 2.1.2 Exploring Foursquare data content

From each received venue, we store just the following data:
- Postal code for which it was downloaded.
- Venue Id
- Venue
- Venue Latitude 
- Venue Longitude 
- **Venue Category**

Our analysis uses mainly the venue category information, and the other collected data is available mainly for informational / exploratory understanding uses.

In the downloaded data there are 461 different venue categories.  Of these we identified further those categories that relate to cafes or restaurants.

- Cafes:
    - select categories which contain words
        - coffee, cafe
    - we found the following venue categories relating to cafes:
        - Cafeteria, Coffee Shop, College Cafeteria
- Restaurants:
    - select categories which contain words
        - pizza, restaurant, blini, breakfast, buffet, burger, burrito, diner, food truck, fried chicken, noodle house, sandwich, steakhouse, taco, wings joint
    - we found the following venue categories (in total 79 categories) relating to restaurants:
        - Afghan Restaurant, African Restaurant, American Restaurant, Asian Restaurant, Australian Restaurant, Austrian Restaurant, Bed & Breakfast, Belgian Restaurant, Blini House,
          Brazilian Restaurant, Breakfast Spot, Buffet, Burger Joint, Burrito Place, ..., Sushi Restaurant, Szechuan Restaurant, Taco Place, Tapas Restaurant, Thai Restaurant, Theme Restaurant,
          Tibetan Restaurant, Turkish Restaurant, Vegetarian / Vegan Restaurant, Venezuelan Restaurant, Vietnamese Restaurant, Wings Joint

Finally, following code snippet loads the data and shows us a sample of it.

In [16]:
FS_DATA_FILENAME = "FourSquare_downloaded_venues_new.csv"

print("Reading venues from file")
fs_venue_df = pd.read_csv(FS_DATA_FILENAME, dtype={"PC": 'str'})

print(fs_venue_df.shape)
fs_venue_df.head()


Reading venues from file
(36690, 8)


Unnamed: 0,PC,PC Latitude,PC Longitude,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,100,60.172207,24.92929,4adcdb1ff964a5208b5f21e3,Konditoria Café Briossi,60.16732,24.938287,Bakery
1,100,60.172207,24.92929,4adcdb1ff964a5208d5f21e3,Fazer Café,60.168481,24.947506,Café
2,100,60.172207,24.92929,4adcdb1ff964a520a65f21e3,Café Strindberg,60.16769,24.946243,Café
3,100,60.172207,24.92929,4adcdb20f964a520ce5f21e3,KuuKuu,60.1754,24.92519,Scandinavian Restaurant
4,100,60.172207,24.92929,4adcdb20f964a520cf5f21e3,St. Urho's Pub,60.17397,24.9315,Beer Bar


### 2.1.3 Notes on Foursquare data

1. The **effect of the radius value on the explore result set**.  As the radius is slightly on the long side, we do end up in downloading some venues for more than just one postal code area.  This is not considered a problem for this study, it is just something to be aware of.  The logic is that if a venue is on the border of two postal code areas, it serves the people / businesses of both areas, and because of this proximity, it is ok to include it in both.  It is also acceptable as this rule applies to all postal codes, thus not just benefitting some.

2. The above note #1 is actually somewhat inevitable, as the **postal code area shapes on map vary a lot** - they can be for example long and narrow or circular or square or shaped like a letter 'P'.  With this data it is not easily feasible to find out a venue's true home postal code area.

3. **Data coverage and timeliness**: It is uncertain how well the foursquare venue data matches the actual venues in the location.  Some venues like cafes may have run out ouf business and new cafes may have emerged which are not visible in the foursquare data.  As we have no way of knowing, we just need to assume that the coverage is good, and that it is reasonably up to date.

## 2.2 Paavo data

---

Paavo data is open data by postal code area.  It is published and maintained by Statistics Finland, a new updated version is published every January.

- More information about Statistics Finland can be found from http://www.tilastokeskus.fi/index_en.html
- The official description of Paavo data can be found from http://tilastokeskus.fi/tup/paavo/paavo_kuvaus_en.pdf

In short, Paavo contains all the 3026 postal code areas in Finland, and for each postal code area it describes statistical information in 103 data columns. The statistical data consists of variables in eight data groups. 

1. Population Structure (24 variables) HE
2. Educational Structure (7 variables) KO
3. Inhabitants' Disposable Monetary Income (7 variables) HR
4. Size and Stage in Life of Households (15 variables) TE
5. Households' Disposable Monetary Income (7 variables) TR
6. Buildings and Dwellings (8 variables) RA
7. Workplace Structure (26 variables) TP
8. Main Type of Activity (9 variables) PT

Each data group has a two letter code (for example 'HE' for the population structure) that is mentioned in the names of all variables belonging to that data group.  Additionally, the database contains the following identification data: postal code, name of the postal code area, coordinates (X and Y) and municipality code. 

Paavo data is explored in more detail below.


### 2.2.1 Downloading Paavo data

Paavo data is available for download through a couple of different methods.  For this study the 'Graphical User Interface' -method was used, downloading the data from here: http://pxnet2.stat.fi/PXWeb/pxweb/en/Postinumeroalueittainen_avoin_tieto/?rxid=4e21d676-5dd1-4575-ab30-d35b741089d4


### 2.2.2 Exploring Paavo Data content

Following provides a brief understanding about what kind of data Paavo actually provides.  It shows all the available variables as rows, and variable values for three locations:

1. whole Finland (country totals)
2. Postal code area 00100, which is downtown Helsinki, Finlands capital
3. Postal code area 89840, which is a somewhat rural area in easterns Finland (Suomussalmi)

In [4]:
import os
import pandas as pd

# First load the data

PAAVO_FILENAME = 'paavo_9_koko_en_tab.csv'
paavo_df = None
if os.path.isfile(PAAVO_FILENAME):
    paavo_df = pd.read_csv(PAAVO_FILENAME, sep='\t', encoding='iso-8859-1')
    print("Loaded Paavo data.\nFound {} rows and {} columns of data.".format(paavo_df.shape[0], paavo_df.shape[1]))

else:
    print("Did not find data file:", PAAVO_FILENAME)

# Create a subset dataframe to inspect data.  In the transposed dataframe:
#    - Column 0 is for whole Finland,
#    - Column 1 is for postal code 00100 (Finlands Capital, center)
#    - Column 2600 is for postal code 89840 (very rural area)
#
paavo_fin_df = paavo_df.T[[0, 1, 2600]]
paavo_fin_df.columns = ["Whole Finland", paavo_df.iloc[1,0], paavo_df.iloc[2600,0]]


Loaded Paavo data.
Found 3027 rows and 105 columns of data.


#### 2.2.2.1 Population Structure, (24 variables) HE

In [5]:
paavo_fin_df[4:28]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Inhabitants, total, 2017 (HE)",5513130,18284,57
"Females, 2017 (HE)",2793999,9613,20
"Males, 2017 (HE)",2719131,8671,37
"Average age of inhabitants, 2017 (HE)",42,41,63
"0-2 years, 2017 (HE)",160297,434,0
"3-6 years, 2017 (HE)",240994,521,0
"7-12 years, 2017 (HE)",369950,711,0
"13-15 years, 2017 (HE)",177163,274,0
"16-17 years, 2017 (HE)",117857,185,0
"18-19 years, 2017 (HE)",120218,264,1


#### 2.2.2.2 Educational Structure (7 variables) KO

In [6]:
paavo_fin_df[28:35]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Aged 18 or over, total, 2017 (KO)",4446869,16159,57
"Basic level studies, 2017 (KO)",1112261,1996,30
"With education, total, 2017 (KO)",3334608,14163,27
"Matriculation examination, 2017 (KO)",303230,2618,1
"Vocational diploma, 2017 (KO)",2035528,2942,24
"Academic degree - Lower level university degree, 2017 (KO)",518969,2899,2
"Academic degree - Higher level university degree, 2017 (KO)",476881,5704,0


#### 2.2.2.3. Inhabitants' Disposable Monetary Income (7 variables) HR


In [7]:
paavo_fin_df[35:42]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Aged 18 or over, total, 2016 (HR)",4431392,15935,60
"Average income of inhabitants, 2016 (HR)",23812,38985,16166
"Median income of inhabitants, 2016 (HR)",20925,26642,14939
"Inhabintants belonging to the lowest income category, 2016 (HR)",886431,2856,26
"Inhabitants belonging to the middle income category, 2016 (HR)",2658687,6668,31
"Inhabintants belonging to the highest income category, 2016 (HR)",886274,6411,3
"Accumulated purchasing power of inhabitants, 2016 (HR)",105520349469,621218859,969978


#### 2.2.2.4. Size and Stage in Life of Households (15 variables) TE


In [9]:
paavo_fin_df[42:57]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Households, total, 2017 (TE)",2680077.0,10205.0,34.0
"Average size of households, 2017 (TE)",2.0,1.8,1.7
"Occupancy rate, 2017 (TE)",40.5,38.6,56.4
"Young single persons, 2017 (TE)",291052.0,2101.0,1.0
"Young couples without children, 2017 (TE)",115168.0,861.0,0.0
"Households with children, 2017 (TE)",570112.0,1326.0,0.0
"Households with small children, 2017 (TE)",142781.0,400.0,0.0
"Households with children under school age, 2017 (TE)",278849.0,715.0,0.0
"Households with school-age children, 2017 (TE)",263490.0,541.0,0.0
"Households with teenagers, 2017 (TE)",221106.0,373.0,0.0


#### 2.2.2.5. Households' Disposable Monetary Income (7 variables) TR


In [10]:
paavo_fin_df[57:64]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Households, total, 2016 (TR)",2654657,10042,36
"Average income of households, 2016 (TR)",39270,61679,26975
"Median income of households, 2016 (TR)",31824,38895,23598
"Households belonging to the lowest income category, 2016 (TR)",677223,1697,13
"Households belonging to the middle income category, 2016 (TR)",1500917,4123,22
"Households belonging to the highest income category, 2016 (TR)",476517,4222,1
"Accumulated purchasing power of households, 2016 (TR)",104247634221,619383515,971110


#### 2.2.2.6. Buildings and Dwellings (8 variables) RA


In [11]:
paavo_fin_df[64:72]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Free-time residences, 2017 (RA)",507200.0,0.0,103.0
"Buildings, total, 2017 (RA)",1523196.0,634.0,90.0
"Other buildings, 2017 (RA)",228770.0,326.0,14.0
"Residential buildings, 2017 (RA)",1294426.0,308.0,76.0
"Dwellings, 2017 (RA)",2946814.0,11884.0,48.0
"Average floor area, 2017 (RA)",80.1,65.9,97.7
"Dwellings in small houses, 2017 (RA)",1568029.0,2.0,48.0
"Dwellings in blocks of flats, 2017 (RA)",1378785.0,11882.0,0.0


#### 2.2.2.7. Workplace Structure (26 variables) TP


In [12]:
paavo_fin_df[72:98]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Workplaces, 2016 (TP)",2094313,48470,16
"Primary production, 2016 (TP)",56104,104,0
"Processing, 2016 (TP)",461153,1805,0
"Services, 2016 (TP)",1576997,46560,16
"A Agriculture, forestry and fishing, 2016 (TP)",56104,104,0
"B Mining and quarrying, 2016 (TP)",5283,0,0
"C Manufacturing, 2016 (TP)",283209,752,0
"D Electricity, gas, steam and air conditioning supply, 2016 (TP)",11714,554,0
"E Water supply; sewerage, waste management and remediation activities, 2016 (TP)",10703,1,0
"F Construction, 2016 (TP)",150244,498,0


#### 2.2.2.8. Main Type of Activity (9 variables) PT


In [13]:
paavo_fin_df[98:]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Inhabitants, 2016 (PT)",5503297,18035,61
"Employed, 2016 (PT)",2275679,10032,13
"Unemployed, 2016 (PT)",355837,856,6
"Children aged 0 to 14, 2016 (PT)",894178,1812,0
"Students, 2016 (PT)",407905,1198,1
"Pensioners, 2016 (PT)",1389830,3326,40
"Others, 2016 (PT)",179868,811,1


### 2.2.3 Notes on Paavo dataset

Note the following about Paavo data:

1. **Time scale:** Some variable values are based on year 2016 data, some on year 2017 data.  This is not a major issue for this study, but this may affect the results a bit.
2. Postal code **location information** is not latitude and longitude information, but instead X and Y information.  This is effectively an alternative coordinate system (or a measurement way), and it needs conversion to latitude and longitude values, before we can combine this data with Foursquare data.
3. The coordinate values (X and Y) point to the **center of the postal code area**.  In some cases, this center location is not the same as one would intuitively expect.  For example, some city's center postal code areas X and Y may actually locate over a lake or in some other location, but this is due to the shape of the postal code area.  This is somewhat compensated for via the radius value, when looking for FourSquare venue data for that postal code area. 
4. **Small value filtering:** Some postal code areas are rather small in population or other metrics, and if the total is less than 30, then Paavo data does not contain the details for those areas.  This is something that we have to take care of when preparing the data.  After cleaning data from postal codes with too few samples, we are left with 2108 postal code areas.
5. **This study focuses on the top 20 cities in Finland plus one smaller city as bonus: Savonlinna**, which is a midsized city busy at summer of vacation people, but not so busy at other times of the year.  After selecting only those postal code areas that relate to these chosen cities, we have 677 postal code areas to work with.

## 2.3 Notes on combining the Foursquare and Paavo data

---

1. In this study, Paavo data is the primary data, that is then used to get the respective Foursquare data.  Venue data is used to augment the postal code area statistics.  These data are joined together by location data.

2. Only venue category data is combined to Paavo data.  That is, for each postal code area we add information about how many venues of each type there is.  For example 2 cafes and 5 bus stops.

3. Time span of data: for Paavo statistics, we know the year for each variable, but for Foursquare venue data we cannot be certain how it matches the statistics.  For example, was the venue present in the location at the time of the statistics information?  Since we cannot know for sure, we need to assume that in general the timespans match _well enough_ - without being able to specify what is well enough.

4. To ease focusing on our research question, create additional summaries of cafe venues and restaurant venues.


# 3. Methodology

---

<div style="color:blue">

_Methodology section represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why._
</div>

-----

## 3.1 Exploratory venue data analysis and confirmation

**Part 1 - Reasoning for analysis**: Before further analysis of the combined data, I first checked if Foursquare venue data was representative enough, mainly that there are no apparent gaps.  Such gaps could be present if the downloading process, which is essentially data selection process, is somehow biased or faulty.  To be precise, I wanted to check that the values used for Foursquare _explore_ endpoints _radius_ and _limit_ parameters did not cause any such problems.

**Part 1 - Process**: The _radius_ parameter was checked for by comparing the radius used for the call to the distance of venues from postal code area center.  Their relative size was then plotted over all postal code areas.  This was also compared to the count of venues per postal code area, to see if venue distance was connected to venue count somehow.

`Analysis_1_figure_1.png:`

<img src="Analysis_1_figure_1.png" style="">

**Part 1 - Result:** The upper plot shows us that the venue data results from the _explore_ endpoint are not harmfully restricted by the radius parameter of the Foursquare request. That is, despite the radius value many postal code areas have venues mostly further away than the radius parameter of the call.  Even the distance to the closest venues may be longer than the calculated radius for the postal code area, as can be seen from the blue area on the left side of the upper plot.

The lower plot shows us that the amount of result venues varies regardless of the average venue distances from the postal code area center.  It would be a bad sign if all the areas with lots of venues would be on the right side of the plot, where the venues are on average closer than the radius: this could mean that there would be more venues to download, but the limit of 50 blocked us from getting them all.

In addition to the plots, the correlation analysis shows strong correlation between the min, mean and max values over the postal code areas, but a nonexistent correlation between these and the venue count over postal code areas.  This adheres to the plot visualizations.

To summarize the results in above plots, the download choices that were made did not create any observable bias that we can see.  Thus the venue data seems to cover the postal code areas in a representative manner.



**Part 2  Reasoning for analysis**: The _limit_ parameter was checked for by checking the amount of overlap of venues in postal code area results.  That is, if a venue is returned to at least two postal code areas, then there is overlap on that venue and on those postal code areas.  This is to be expected because the postal code area shapes on map are quite irregular, and we have to approximate the queries to the center of postal code area and a straight line radius.  In areas where there are many venues, it may happen that the limit is reached before all venues are received.  Thus, overlap is a good sign that all venues have been received.  This analysis was done from two viewpoints: from venue and from postal code area viewpoints.

**Part 2.1 - Process**: Venue viewpoint starts by first identifying on each venue how many postal code areas mentioned it. Then venues were grouped based on this count: venues mentioned by only one postal code area, venues mentioned by two postal code areas, venues mentioned by three postal code areas etc.  Then a pie chart was created based on the relative sizes of these groups.

`Analysis_1_figure_2.png`

<img src="Analysis_1_figure_2.png" style="">

**Part 2.1 - Result:** Above analysis shows us that:

- in total there are 17 702 distinct venues in our dataset
- less than half of the venues in the dataset are mentioned only by one postal code area.  That is, more than half of the venues are mentioned by at least two postal code areas.

This is a good sign because it is likely that our chosen venue gathering parameters (mainly the radius and then overcoming the limit of results) were successful in providing us a representative result for postal code areas, and likely not many venues were missed.


**Part 2.2 - Process**: Postal code area viewpoint continues from this by checking for each postal code area how many of its venues are shared by at least one other postal code area. This was plotted in descending order on the total amount of venues for the postal code area.  Alongside this another plot was created describing for each postal code area, how many connections the most highly connected venue in the postal code area has.  Here _connection_ means how many postal code areas refer to it.  These two plots are shown together to see any correlations between these two aspects. 

`Analysis_1_figure_3.png:`

<img src="Analysis_1_figure_3.png" style="">

**Part 2.2 - Result:** Above plots tell us how much the venue data is overlapping the postal code area borders.  In the upper plot the black line tells us the number of venues in a postal code area, over all postal code areas sorted to descending order of total venues.  The blue area below the black line tells us how many of that postal code area venues are shared with at least one other postal code area.  Increasing white area below the black line tells that there are venues that exist only for that postal code area.  As we can see, the area below the black line is mainly blue meaning that most postal code areas reach out to same venues with other postal code areas.  This tells us that quite likely we have received all venues that there are to receive.

The lower plot in red tells for each postal code area what is the most shared venue's connection count.  The postal code areas are in same order in the two above plots.  This plot varies quite steadily between 2 and 7 from left to right, showing only slight decreasing tendency.  This is good as it shows that the connectedness of a postal code area to its neighboring postal code areas does not depend very strongly on its total venue count.


**Result for parts 1 and 2**

As a summary of investigating the Foursquare venue data we can conclude that we have managed to download a good sample of the data, if not absolutely all.  This is good to know as matching together the postal code area shapes, exact borders and the Foursquare venue data API took some assumptions, which left out questions if we have harvested the data well enough.  We have.

## 3.2 Identify the top 20 cities

---

The top 20 cities of Finland were identified from Paavo data by their population size.  It was noted that they cover about 53 % of the finnish population.

The cities are, in order of population from biggest to smaller:

```Helsinki, Espoo, Tampere, Vantaa, Oulu, Turku, Jyväskylä, Kuopio, Lahti, Pori,
Kouvola, Joensuu, Lappeenranta, Vaasa, Hämeenlinna, Seinäjoki, Rovaniemi, Mikkeli, Kotka, Salo,
Savonlinna
```


## 3.3 Cluster postal code areas of the top 20 cities

---

Using KMeans clustering algorithm, the postal code areas of the top 20 cities were clustered.  First I ran a test to see what sizes of clusters come out of different values of _k_.

`Analysis_3_figure_1.png:`

<img src="Analysis_3_figure_1.png">

Above plot shows the clustering results by showing the sizes of the resulting clusters with a given _k_.  For example, when _k_ = 2, the biggest cluster has 536 postal code areas and the second biggest cluster has 141 postal code areas.  And when *k = 7*, the sizes of the clusters are (in order from biggest cluster to smallest cluster): 314, 223, 106, 21, 10, 2 and 1.  We also note that when *k* > 5, at least the smallest cluster is always of size one (1).

Then a value for _k_ was chosen so that the size of even the smallest cluster is greater than one postal code area.  Thus I chose *k = 5*. This gives us following cluster sizes: 320, 204, 71, 63 and 19.

## 3.4 Summary of the postal code area clusters

---

Summary of the clusters of the postal code areas was created to identify key metrics and properties of each cluster.  Summaries on cluster level were created by grouping the postal code areas by their clusters and then processing the data values via summing or meaning, depending on the data.  A few extra relational metrics was added to the data to help in identifying their key differences.

`Screenshot_PC_clusters_1.JPG`

<img src="Screenshot_PC_clusters_1.JPG">

Above screenshot shows the basic summary information for each cluster.  A few notes:

- Here the cluster 0 (upmost row in above table) means that those postal code areas were not included in the clustering, because they do not belong to the top 20 cities of Finland.

- **cluster 1** postal code areas average surface area is much larger than that of other clusters. Also the average inhabitant count per postal code area is clearly the smallest.  The average inhabitants per square km is by far the smallest of all, by a factor of about 60.

- The average amount of cafes per postal code area grows clearly as the cluster sizes get smaller (the *PC cluster count* column).  Same goes for average amount of restaurants and venues in total per postal code area.

- **cluster 3** postal code areas are on average the second smallest, but the biggest in population, and they are also most densely populated.  Still the cafe, restaurant and total venue offerings is only about 1/3 of that what clusters 4 and 5 offer.  So there are many people, but not so many cafes / restaurants / venues in these postal code areas.

- **clusters 2 and 4** are somewhat similar, if one considers the average area per postal code, average population per postal code, and thus the average population per square km.  However, they differ on their venue offering amounts: cluster 4 has about 12 times more cafes on average, and about 7 times more restaurants than cluster 2 postal codes.  Cluster 4 contains in total the most cafes, restaurants and total venues of all clusters, even if cluster 4 is the second smallest cluster (as measured by the postal code area count in cluster).

- **cluster 5** postal code areas are the smallest on average, and they are very densely populated.  They have the most venue offerings per population.

Based on above observations, cluster 1 postal code areas appear as quite rural areas, when the other clusters appear as at least some sort of population centers.  Clusters 3-5 appear as city areas, where clusters 4 and 5 are more like city centers, and cluster 3 is perhaps the neighborhood where most people live, just a little outside of the centers.  This analysis furthers later in this study.


## 3.5 Summary of the cities

---

Summary of the cities was created so that cities themselves could be clustered directly.  Summaries on city level were created by grouping the postal code areas by city name and then processing the data values via summing or meaning, depending on the data, as was done above for clusters.

```Screenshot_City_top20_summary.PNG: basic info of the biggest cities.```

<img src="Screenshot_City_top20_summary.PNG">

Above table shows the cities in decreasing order by population.

More of these results will be presented in section 3.8, where the summaries are enhanced with information on how city postal code areas split between different postal code clusters.

## 3.6 Clustering cities

---

Cities were clustered twice.  First only the top 20 cities (as identified before) were clustered, but then also all the cities (and counties) were clustered to see how much more diverse the results would be.  Cluster size testing for both cases was plotted for exploratory analysis of best k.

`Analysis_6_figure_1.png: clustering the top 20 cities`

<img src="Analysis_6_figure_1.png">

`Analysis_6_figure_2.png: clustering all cities (and counties)`

<img src="Analysis_6_figure_2.png">

As the above plots visualize, clustering the cities did not provide very interesting clusters.  The biggest cluster tended to contain most of all cities, and as the value of parameter k increased, it typically just caused the additional clusters to be very, very small, mainly of size 1. 

After these results the city clusters were not investigated further in this study.


## 3.7 Analyse top 10 venues

---

Analyse the top 10 venues for 1) cities (city summary), 2) individual postal code areas and 3) postal code clusters (PC summary).  Top 10 venues helps to identify key differences between rows in data. Here we focus only on the 1st and 3rd summaries.

```Screenshot_10_Most_Common_venues_per_city.PNG```

<img src="Screenshot_10_Most_Common_venues_per_city.PNG">




```Screenshot_10_Most_Common_venues_per_PC_cluster.PNG```

<img src="Screenshot_10_Most_Common_venues_per_PC_cluster.PNG">

The more interesting of these two is perhaps the latter, most common venues of postal code area clusters.  Here we can see that:

- **cluster 1** is grocery store and supermarket centered, having varying types of venues after these two, including cafes and pizzerias.

- **cluster 2** is similar, but bus stops dominate here above all else.

- **clusters 4 and 5** are very cafe, restaurant and bar oriented in their top venues.  In cluster 4 grocery store still makes it to the 10th most common venue, but not anymore in cluster 5.  Instead, cluster 5 starts to have perhaps more finesse venues offerings, like sushi restaurants, beer bars and gastropubs.

- **cluster 3** falls between above, just as it does in cluster size, too.  Cluster 3 is more cafe, restaurant and bar oriented than clusters 1 and 2, but still it has both grocery store and supermarkets, and restaurant offerings are more on the pizza / fastfood area.

Above findings on PC cluster's top 10 venues actually fits well together with our findings in section 3.4 (Summary of the postal code area clusters).

## 3.8 PC cluster proportions in each city

---

Analyze for each city, how many of the city's postal code areas belong to which postal code cluster.  Based on understanding the key features of the postal code clusters, as we have accumulated above in sections 3.7 and 3.4, this helps us understand the cities better.

```Screenshot_PC_clusters_in_cities.PNG```

<img src="Screenshot_PC_clusters_in_cities.PNG">

Above table confirms our understanding of the PC (postal code area) clusters.  Namely,

- the proportion of cluster 1 grows as cities get smaller, meaning smaller cities have more rural areas in their areas than big cities (considering the areas within city limits).

- cluster 5 is present only in two of the very top cities.

- cluster 2 is typically the most common PC cluster in the top 8 cities, besides in Kuopio.  After top 8 cities cluster 2 gives the position of most common PC cluster to cluster 1, except in _Vaasa_ which is on position 13.

- proportion of clusters 3 and 4 decrease steadily as the city size decreases.  However, in the top 6 cities, if clusters 3 and 4 were combined together, they compete quite head-to-head with cluster 2 on being the most common PC cluster.  This is a bit hypothetical, but these clusters' postal code areas are quite city like (based on previous analyses), so it is noteworthy.  The combined proportion of these two clusters decreases, but plays the role of smaller town centers, at least in the top 20 cities.   

## 3.9 Correlations between cafes and other data

---

Next we run correlation analysis (pearson coefficient) on which data correlates best with amount of cafes per postal code area.  Purpose is to find the best matching data features, but also pay attention to order and correlation within certain data groups, like the Paavo variable groups, as presented in section 2.2.2 (exploring Paavo data content).  Venues categories is taken as one such data group by itself.

Correlations were run on whole Finland and within PC clusters.  Below are some plots of these (more can be found in the workbook).

```Analysis_10_correlations_Cafés_and_age.png```

<img src="Analysis_10_correlations_Cafés_and_age.png">

Above plot tells us how Paavo data features on inhabitant age correlates with the amount of cafes per postal code area.  We can see that the age group 25-29 years correlates best with amount of cafes, the exact correlation being 0.58 meaning it is a moderate correlation.  A correlation of 0.7 or above could be considered a strong correlation.

The correlation that was created using all postal codes ('All data') is very high in this comparison, and it is actually very high on all comparisons.  In many cases it gives the best correlation.  The clusters that come close are typically some of the smaller clusters in size, meaning they don't have very many datapoints to build on.  Because of this finding, all postal code areas were used in the next section for linear regression analysis.


```Analysis_10_correlations_Cafés_and_jobs.png```

<img src="Analysis_10_correlations_Cafés_and_jobs.png">

Another example shows the correlations of cafes and different categories of jobs or employment.  We see that precense of services professions in general correlates well with amount of cafes.  Equally high correlation (0.57) is with _R Arts, entertainment and recreation_ professions.  Also interesting to see is that _A Agriculture, forestry and fishing_ has a slightly negative correlation to cafe amounts in the region - which makes sense! Even if the correlation is close to zero meaning there is hardly any correlation.

Again, 'All data' gives the highest correlation peaks and is generally among the highest.


```Analysis_10_correlations_Cafés_and_venues.png```

<img src="Analysis_10_correlations_Cafés_and_venues.png">

Finally above plot is slightly different in appearance.  On the x axis are the venue categories that had at least one correlation test exceeding the 0.5 correlation limit (in absolute value, some have hit below the the negative 0.5 correlation.  For example, _bars_ have a strong correlation to cafes in the region, and so do _pubs, Sushi restaurants, Italian restaurants, scandinavian restaurants_ and _restaurants_ in general.


Again, the correlations with 'All data' score well in this comparison.

## 3.10 Linear regression analysis and predictions

---

Based on previous correlation analysis, I chose the most promising features from both Paavo dataset and the venue categories, and used them as features for machine learning.  Purpose was to teach the machine to predict the amount of cafes for a postal code area.  With this information we could compare the prediction to the actual amount of cafes in the area, and identify any potential areas where there could be demand for more cafes.

Chosen features for training the linear regression model were:

- Bar
- Italian Restaurant
- Pub
- Restaurant
- Scandinavian Restaurant
- Sushi Restaurant
- 25-29 years, 2017 (HE)
- Matriculation examination, 2017 (KO)
- Academic degree - Higher level university degree, 2017 (KO)
- Inhabintants belonging to the highest income category, 2016 (HR)
- Young couples without children, 2017 (TE)
- Households belonging to the highest income category, 2016 (TR)
- Dwellings in blocks of flats, 2017 (RA)
- R Arts, entertainment and recreation, 2016 (TP)
- Services, 2016 (TP)
- Employed, 2016 (PT)

Testing the model, all of postal code areas were used, and train and test split was 80% / 20%.  The training and testing of the model was repeated 600 times, giving out an average of 0.83 for average variance score. Exact score varied a bit from test to test because of the different split in train and test data each time.

The average variance seemed appropriate, so next I _trained a model with all of the paavo data_ (no more splitting to test and train data).  It had average variance of 0.86.  Then I used the trained model to predict the amount of cafes for all postal code areas, after which I wanted to identify the areas where the estimate was over the actual by biggest amount.



```Screenshot_Linear_Regression_cafe_predictions.PNG```

<img src="Screenshot_Linear_Regression_cafe_predictions.PNG">

Above screenshot shows those postal code areas where the linear regression model's prediction (_Predicted Cafes Total_) was at least 5 cafes more than the actual amount.


The next map shows how the entries in above table spread out over Finland's map, and when zooming in, how they locate nere Helsinki and Tampere areas.

```Screenshot_Cafe_potential_map_1.PNG```

<img src="Screenshot_Cafe_potential_map_1.PNG">



```Screenshot_Cafe_potential_map_2_Helsinki.PNG```

<img src="Screenshot_Cafe_potential_map_2_Helsinki.PNG">



```Screenshot_Cafe_potential_map_3_Tampere.PNG```

<img src="Screenshot_Cafe_potential_map_3_Tampere.PNG">



# 4. Results

---

<div style="color:blue">

_where you discuss the results..._

</div>


Let's start with quickly what we concluded in section 3 (Methodology):

1. **Venue data was checked** for coverage and that there are no missing spots.  We couldn't identify any, data spread seemed credible.

2. The **top 20 cities** were identified by population count

3. Identified and **clustered the postal code areas** of the top 20 cities.  _K = 5_ was chosen after testing cluster sizes with different values for k.

4. **Summary of the postal code area clusters** (= PC clusters).  Here we started building our understanding, similarities and difference of the different PC clusters.

5. **Summary of the cities was created**.  At this stage it gave some summary numbers, but not much interesting facts that wouldn't already be available in the internet, like in wikipedia pages. 

6. We attempted to **cluster the cities themselves**, based on the citylevel summaries.  This was attempted on both the top 20 cities alone, and also on all cities and counties (300+ in total), but neither gave us much anything to build on - vast majority of cities remained in one huge cluster almost regardless of the value of k. 

7. Analyzed the **top 10 venues** for each city and PC cluster.  The latter provided more useful in giving us more understanding of the similarities and differencies between PC clusters.  

8. **PC cluster proportions** were calculated for each city.  By now we started to have some understanding of the profile of these cities, based on how city areas balance over different PC clusters.

9. Then we started to look for **correlations betweeen the amount of cafe venues and other data features**.  Correlations were checked for from all of the data as well as from each cluster alone.  It turned out that the best correlations were given by using all of the postal codes as data.  We identified features that correlate at least moderately with the amount of cafes.

10. Finally, a linear regression model was built to predict how many cafes there would be in each postal code area.  This prediction was compared to actual data and this way we identified areas where the prediction was clearly higher than the actual amount.  These areas are potential new cafe business locations. 



In the introduction section this study set as its goal the following questions:

- Overall business problem: _How do Finnish cities appear to someone willing to start a new cafe in Finland_".
- Part 1 of business problem - overall understanding of finnish cities: **What are the characteristics that differentiate the cities and how each city compares to each other in relation to these characteristics?**
- Part 2 of the business problem - in more detail any insight about cafe opportunities within these cities: **Which other factors correlate with the offering of these cafe services**

Part 1 of the business problem was answered by the sections 3.2 - 3.8 in the methodology section, where as the sections 3.9 & 3.10 contributed to answering the part 2 of the business problem.


# 5. Discussion

---
<div style="color:blue">

_where you discuss any observations you noted and any recommendations you can make based on the results_

</div>

---

In general the findings of the correlating features with cafe amounts may not be surprising, but it is interesting enough that the data could provide and verify them.  For example that the amount of young adults aged 25-30 (and more broadly, 20-30) correlate well with the amount of cafes.  So does higher education, and certain kinds of venues, like Sushi restaurants and arts, entertainment and recreation jobs and employment possibilities.  These results are hardly surprises if one thinks of the trendy cafe boom in the last 20 years in Finland, especially in cities and urban contexts.  Still, this data clearly supports it.

Also, the findings on the nature of the cities are hardly surprising.  For example how the proportion of the different kinds of postal code area clusters vary between the top 20 cities.  Still, data backs up the facts that the biggest cities are different from the rest, having different kind of balance with _very urban_ and _rural_ areas.  Top 3 cities have characteristics that others do not, like the presence of very condensed city centers.  And cities outside the top 15 quickly start to look alike.

Recommendations based on the results: I believe that based on the data we have used, the identified potential cafe locations are worth studying more, via other means like going on site.  It may be that there is a good reason not to start cafe business in these locations, but it is likely due to facts that do not come through from the datasets that were used.  That is, there can be environmental or other considerations that the used data does not cater for.

About the methodology: more and different machine learning models could have been tried and compared how they perform against each other, and what kind of results they would find.

The biggest questionmark of this kind of study is the data that was used.  I believe in the Paavo data, although somewhat limited but still solid. But I still find myself questioning the usefulness of the venue data from Foursquare.  It is clearly not accurate in absolute numbers, as there are much more cafes for example that are entered into the service.  It may not be a problem, if it is still a good representation of what is out there.  But that is the question we can only assume - for example, how balanced the Foursquare user group is.  If majority of the users is young adults aged 20-30 years, that is likely to have an effect on what kinds of venues and how many get entered into the service.  And do the enter only their favorite venues?  Thus, more and other data sources would be an interesting addition to further this kind of study.

# 6. Conclusion

---


This study answered the question of how do Finnish cities appear to someone willing to start a new cafe in Finland.  We had two major data sources, postal code area statistical data (Paavo) from Statistics Finland and venue data from Foursquare.  We used exploratory, statistical and machine learning techniques to answer the question. Combining the data and the methodology we were able to give characteristics of the top 20 cities in Finland, give features to look for when founding new cafe business, and propose locations to scout when looking for cafe business opportunities.