# Capstone Project Final Report

This notebook is the final project report for Coursera's _Applied Data Science Capstone_ -course.

# 1. Introduction

---

This reports studies the northern Europe country Finland, and more specifically its 20 biggest cities, measured by population.  The overall large question to answer is "_How do Finnish cities appear to someone willing to start a new cafe or restaurant in Finland_".  The business problem is further divided into two parts: first the overall understanding of finnish cities, and then in more detail any insight about cafe / restaurant opportunities within these cities.

First the audience... As already stated, the audience for this study is **anyone with an entrepreneurial spirit and cafe or restaurant business creation intentions**.  This report is especially valuable information to immigrants in Finland with such intentions, as they do not necessarily have very good understanding of Finland to begin with.  However, this report will also serve native finnish people with same intentions, as usually people have a good understanding only of the nearby areas - in some cases more distant locations might provide more or better opportunities.

The first question of the overall understanding of finnish cities can be phrased as: "**What are the characteristics that differentiate the cities and how each city compares to each other in relation to these characteristics?**"  For example, is the population size the only differentiating factor between these cities, or are there also others, perhaps even more substantial characteristics?  Are there some characteristics that perhaps tell us something about the environment or the culture within these cities.  These characteristics can be used to identify potential needs in different areas, and help in profiling the new cafe or restaurant service and finding it the best target audience.  For example, a cafe for university students and a cafe for construction workers may benefit from understanding the needs of these different client groups.

The second, more detailed question is "**Which other factors correlate with the offering of these cafe or restaurant services**".  For example one assumption is that population should somehow correlate to the amount of cafes in the neighborhood.  The question then is, what factors seem good indicators - do the likes of education, income or presence of some other services nearby affect the demand for and offering of cafes?  The second question is to identify these factors.  Identifying these factors can then be used to search for good business locations.  For example, if these identified factors suggest that some area should have X cafes, but it has only Y, that area is perhaps a good opportunity for business.  


# 2. Data

---

This project uses two datasets, one is venue data provided by _Foursquare_, and the other is postal code area statistics dataset called Paavo, which is provided as open data from  _Statistics Finland_.


## 2.1 Foursquare data

---

Foursquare provides venue data near a given location.  A venue can be anything a foursquare user has entered into the service, like a cafe, restaurant, museum, park or a bus stop. Foursquare provides a lot of data about venues, like reviews of the venues, their location, type (category) etc.  Foursquare also provides information on users who report and evaluate venues, mainly focusing on user preferences.  Foursquare venue data is not open data, but it is available for free (to certain limits) for those who sign up as developers.  More information about Foursquare can be found at https://foursquare.com/

In this study we use only a small fraction of the venue data attributes available, focusing mostly on venue category information.


### 2.1.1 Downloading Foursquare venue data:

The overall data downloading process in short: For each postal code area, we call Foursquare's _explore_ -functionality to get a list of venues in that postal code area.  The results are stored in a Pandas DataFrame and then saved into a local file in CSV format for later analysis.

The explore functionality is called with postal code area centers' latitude and longitude information, and with a radius that is the square root of the postal code area size. Roughly, on average the radius is a bit on the long side, but this way we should be able to get all the venues. 

Foursquare's _explore_ functionality has a limit on the number of returned results, it is 50 venues at maximum.  So, if the result set for any postal code area has the maximum amount of 50 venues, it means we might not have received all venues there actually are.  In these situations, considering our target audience, we do three more download requests, this time focusing the additional requests with the _section_ attribute to values of _food_, _drinks_ and _cafe_.  This way we attempt to ensure that we should get at least most of the cafes and restaurants, if the location has a lot of venues close to the coordinates.  Despite this extra attempt, it is not guaranteed that we get all venues there are.

More information about Foursquare's explore -functionality (or endpoint as they call it): https://developer.foursquare.com/docs/api/venues/explore

After downloading venue data for each postal code area in Finland, we had **36690 venue entries** in our CSV file.


### 2.1.2 Exploring Foursquare data content

From each received venue, we store just the following data:
- Postal code for which it was downloaded.
- Venue Id
- Venue
- Venue Latitude 
- Venue Longitude 
- **Venue Category**

Our analysis uses mainly the venue category information, and the other collected data is available mainly for informational / exploratory understanding uses.

In the downloaded data there are 461 different venue categories.  Of these we identified further those categories that relate to cafes or restaurants.

- Cafes:
    - select categories which contain words
        - coffee
        - cafe
    - we found the following venue categories relating to cafes:
        - Cafeteria
        - Coffee Shop
        - College Cafeteria
- Restaurants:
    - select categories which contain words
        - pizza
        - restaurant
        - blini
        - breakfast
        - buffet
        - burger
        - burrito
        - diner
        - food truck
        - fried chicken
        - noodle house
        - sandwich
        - steakhouse
        - taco
        - wings joint
    - we found the following venue categories (in total 79 categories) relating to restaurants:
        - Afghan Restaurant
        - African Restaurant
        - American Restaurant
        - Asian Restaurant
        - Australian Restaurant
        - Austrian Restaurant
        - Bed & Breakfast
        - Belgian Restaurant
        - Blini House
        - Brazilian Restaurant
        - Breakfast Spot
        - Buffet
        - Burger Joint
        - Burrito Place
        - ...
        - Sushi Restaurant
        - Szechuan Restaurant
        - Taco Place
        - Tapas Restaurant
        - Thai Restaurant
        - Theme Restaurant
        - Tibetan Restaurant
        - Turkish Restaurant
        - Vegetarian / Vegan Restaurant
        - Venezuelan Restaurant
        - Vietnamese Restaurant
        - Wings Joint

Finally, following code snippet loads the data and shows us a sample of it.

In [16]:
FS_DATA_FILENAME = "FourSquare_downloaded_venues_new.csv"

print("Reading venues from file")
fs_venue_df = pd.read_csv(FS_DATA_FILENAME, dtype={"PC": 'str'})

print(fs_venue_df.shape)
fs_venue_df.head()


Reading venues from file
(36690, 8)


Unnamed: 0,PC,PC Latitude,PC Longitude,Venue Id,Venue,Venue Latitude,Venue Longitude,Venue Category
0,100,60.172207,24.92929,4adcdb1ff964a5208b5f21e3,Konditoria Café Briossi,60.16732,24.938287,Bakery
1,100,60.172207,24.92929,4adcdb1ff964a5208d5f21e3,Fazer Café,60.168481,24.947506,Café
2,100,60.172207,24.92929,4adcdb1ff964a520a65f21e3,Café Strindberg,60.16769,24.946243,Café
3,100,60.172207,24.92929,4adcdb20f964a520ce5f21e3,KuuKuu,60.1754,24.92519,Scandinavian Restaurant
4,100,60.172207,24.92929,4adcdb20f964a520cf5f21e3,St. Urho's Pub,60.17397,24.9315,Beer Bar


### 2.1.3 Notes on Foursquare data

1. The **effect of the radius value on the explore result set**.  As the radius is slightly on the long side, we do end up in downloading some venues for more than just one postal code area.  This is not considered a problem for this study, it is just something to be aware of.  The logic is that if a venue is on the border of two postal code areas, it serves the people / businesses of both areas, and because of this proximity, it is ok to include it in both.  It is also acceptable as this rule applies to all postal codes, thus not just benefitting some.

2. The above note #1 is actually somewhat inevitable, as the **postal code area shapes on map vary a lot** - they can be for example long and narrow or circular or square or shaped like a letter 'P'.  With this data it is not easily feasible to find out a venue's true home postal code area.

3. **Data coverage and timeliness**: It is uncertain how well the foursquare venue data matches the actual venues in the location.  Some venues like cafes may have run out ouf business and new cafes may have emerged which are not visible in the foursquare data.  As we have no way of knowing, we just need to assume that the coverage is good, and that it is reasonably up to date.

## 2.2 Paavo data

---

Paavo data is open data by postal code area.  It is published and maintained by Statistics Finland, a new updated version is published every January.

- More information about Statistics Finland can be found from http://www.tilastokeskus.fi/index_en.html
- The official description of Paavo data can be found from http://tilastokeskus.fi/tup/paavo/paavo_kuvaus_en.pdf

In short, Paavo contains all the 3026 postal code areas in Finland, and for each postal code area it describes statistical information in 103 data columns. The statistical data consists of variables in eight data groups. 

1. Population Structure (24 variables) HE
2. Educational Structure (7 variables) KO
3. Inhabitants' Disposable Monetary Income (7 variables) HR
4. Size and Stage in Life of Households (15 variables) TE
5. Households' Disposable Monetary Income (7 variables) TR
6. Buildings and Dwellings (8 variables) RA
7. Workplace Structure (26 variables) TP
8. Main Type of Activity (9 variables) PT

Each data group has a two letter code (for example 'HE' for the population structure) that is mentioned in the names of all variables belonging to that data group.  Additionally, the database contains the following identification data: postal code, name of the postal code area, coordinates (X and Y) and municipality code. 

Paavo data is explored in more detail below.


### 2.2.1 Downloading Paavo data

Paavo data is available for download through a couple of different methods.  For this study the 'Graphical User Interface' -method was used, downloading the data from here: http://pxnet2.stat.fi/PXWeb/pxweb/en/Postinumeroalueittainen_avoin_tieto/?rxid=4e21d676-5dd1-4575-ab30-d35b741089d4


### 2.2.2 Exploring Paavo Data content

Following provides a brief understanding about what kind of data Paavo actually provides.  It shows all the available variables as rows, and variable values for three locations:

1. whole Finland (country totals)
2. Postal code area 00100, which is downtown Helsinki, Finlands capital
3. Postal code area 89840, which is a somewhat rural area in easterns Finland (Suomussalmi)

In [4]:
import os
import pandas as pd

# First load the data

PAAVO_FILENAME = 'paavo_9_koko_en_tab.csv'
paavo_df = None
if os.path.isfile(PAAVO_FILENAME):
    paavo_df = pd.read_csv(PAAVO_FILENAME, sep='\t', encoding='iso-8859-1')
    print("Loaded Paavo data.\nFound {} rows and {} columns of data.".format(paavo_df.shape[0], paavo_df.shape[1]))

else:
    print("Did not find data file:", PAAVO_FILENAME)

# Create a subset dataframe to inspect data.  In the transposed dataframe:
#    - Column 0 is for whole Finland,
#    - Column 1 is for postal code 00100 (Finlands Capital, center)
#    - Column 2600 is for postal code 89840 (very rural area)
#
paavo_fin_df = paavo_df.T[[0, 1, 2600]]
paavo_fin_df.columns = ["Whole Finland", paavo_df.iloc[1,0], paavo_df.iloc[2600,0]]


Loaded Paavo data.
Found 3027 rows and 105 columns of data.


#### 2.2.2.1 Population Structure, (24 variables) HE

In [5]:
paavo_fin_df[4:28]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Inhabitants, total, 2017 (HE)",5513130,18284,57
"Females, 2017 (HE)",2793999,9613,20
"Males, 2017 (HE)",2719131,8671,37
"Average age of inhabitants, 2017 (HE)",42,41,63
"0-2 years, 2017 (HE)",160297,434,0
"3-6 years, 2017 (HE)",240994,521,0
"7-12 years, 2017 (HE)",369950,711,0
"13-15 years, 2017 (HE)",177163,274,0
"16-17 years, 2017 (HE)",117857,185,0
"18-19 years, 2017 (HE)",120218,264,1


#### 2.2.2.2 Educational Structure (7 variables) KO

In [6]:
paavo_fin_df[28:35]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Aged 18 or over, total, 2017 (KO)",4446869,16159,57
"Basic level studies, 2017 (KO)",1112261,1996,30
"With education, total, 2017 (KO)",3334608,14163,27
"Matriculation examination, 2017 (KO)",303230,2618,1
"Vocational diploma, 2017 (KO)",2035528,2942,24
"Academic degree - Lower level university degree, 2017 (KO)",518969,2899,2
"Academic degree - Higher level university degree, 2017 (KO)",476881,5704,0


#### 2.2.2.3. Inhabitants' Disposable Monetary Income (7 variables) HR


In [7]:
paavo_fin_df[35:42]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Aged 18 or over, total, 2016 (HR)",4431392,15935,60
"Average income of inhabitants, 2016 (HR)",23812,38985,16166
"Median income of inhabitants, 2016 (HR)",20925,26642,14939
"Inhabintants belonging to the lowest income category, 2016 (HR)",886431,2856,26
"Inhabitants belonging to the middle income category, 2016 (HR)",2658687,6668,31
"Inhabintants belonging to the highest income category, 2016 (HR)",886274,6411,3
"Accumulated purchasing power of inhabitants, 2016 (HR)",105520349469,621218859,969978


#### 2.2.2.4. Size and Stage in Life of Households (15 variables) TE


In [9]:
paavo_fin_df[42:57]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Households, total, 2017 (TE)",2680077.0,10205.0,34.0
"Average size of households, 2017 (TE)",2.0,1.8,1.7
"Occupancy rate, 2017 (TE)",40.5,38.6,56.4
"Young single persons, 2017 (TE)",291052.0,2101.0,1.0
"Young couples without children, 2017 (TE)",115168.0,861.0,0.0
"Households with children, 2017 (TE)",570112.0,1326.0,0.0
"Households with small children, 2017 (TE)",142781.0,400.0,0.0
"Households with children under school age, 2017 (TE)",278849.0,715.0,0.0
"Households with school-age children, 2017 (TE)",263490.0,541.0,0.0
"Households with teenagers, 2017 (TE)",221106.0,373.0,0.0


#### 2.2.2.5. Households' Disposable Monetary Income (7 variables) TR


In [10]:
paavo_fin_df[57:64]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Households, total, 2016 (TR)",2654657,10042,36
"Average income of households, 2016 (TR)",39270,61679,26975
"Median income of households, 2016 (TR)",31824,38895,23598
"Households belonging to the lowest income category, 2016 (TR)",677223,1697,13
"Households belonging to the middle income category, 2016 (TR)",1500917,4123,22
"Households belonging to the highest income category, 2016 (TR)",476517,4222,1
"Accumulated purchasing power of households, 2016 (TR)",104247634221,619383515,971110


#### 2.2.2.6. Buildings and Dwellings (8 variables) RA


In [11]:
paavo_fin_df[64:72]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Free-time residences, 2017 (RA)",507200.0,0.0,103.0
"Buildings, total, 2017 (RA)",1523196.0,634.0,90.0
"Other buildings, 2017 (RA)",228770.0,326.0,14.0
"Residential buildings, 2017 (RA)",1294426.0,308.0,76.0
"Dwellings, 2017 (RA)",2946814.0,11884.0,48.0
"Average floor area, 2017 (RA)",80.1,65.9,97.7
"Dwellings in small houses, 2017 (RA)",1568029.0,2.0,48.0
"Dwellings in blocks of flats, 2017 (RA)",1378785.0,11882.0,0.0


#### 2.2.2.7. Workplace Structure (26 variables) TP


In [12]:
paavo_fin_df[72:98]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Workplaces, 2016 (TP)",2094313,48470,16
"Primary production, 2016 (TP)",56104,104,0
"Processing, 2016 (TP)",461153,1805,0
"Services, 2016 (TP)",1576997,46560,16
"A Agriculture, forestry and fishing, 2016 (TP)",56104,104,0
"B Mining and quarrying, 2016 (TP)",5283,0,0
"C Manufacturing, 2016 (TP)",283209,752,0
"D Electricity, gas, steam and air conditioning supply, 2016 (TP)",11714,554,0
"E Water supply; sewerage, waste management and remediation activities, 2016 (TP)",10703,1,0
"F Construction, 2016 (TP)",150244,498,0


#### 2.2.2.8. Main Type of Activity (9 variables) PT


In [13]:
paavo_fin_df[98:]

Unnamed: 0,Whole Finland,00100 Helsinki Keskusta - Etu-Töölö (Helsinki,89840 Ylä-Vuokki (Suomussalmi )
"Inhabitants, 2016 (PT)",5503297,18035,61
"Employed, 2016 (PT)",2275679,10032,13
"Unemployed, 2016 (PT)",355837,856,6
"Children aged 0 to 14, 2016 (PT)",894178,1812,0
"Students, 2016 (PT)",407905,1198,1
"Pensioners, 2016 (PT)",1389830,3326,40
"Others, 2016 (PT)",179868,811,1


### 2.2.3 Notes on Paavo dataset

Note the following about Paavo data:

1. **Time scale:** Some variable values are based on year 2016 data, some on year 2017 data.  This is not a major issue for this study, but this may affect the results a bit.
2. Postal code **location information** is not latitude and longitude information, but instead X and Y information.  This is effectively an alternative coordinate system (or a measurement way), and it needs conversion to latitude and longitude values, before we can combine this data with Foursquare data.
3. The coordinate values (X and Y) point to the **center of the postal code area**.  In some cases, this center location is not the same as one would intuitively expect.  For example, some city's center postal code areas X and Y may actually locate over a lake or in some other location, but this is due to the shape of the postal code area.  This is somewhat compensated for via the radius value, when looking for FourSquare venue data for that postal code area. 
4. **Small value filtering:** Some postal code areas are rather small in population or other metrics, and if the total is less than 30, then Paavo data does not contain the details for those areas.  This is something that we have to take care of when preparing the data.  After cleaning data from postal codes with too few samples, we are left with 2108 postal code areas.
5. **This study focuses on the top 20 cities in Finland plus one smaller city as bonus: Savonlinna**, which is a midsized city busy at summer of vacation people, but not so busy at other times of the year.  After selecting only those postal code areas that relate to these chosen cities, we have 677 postal code areas to work with.

## 2.3 Notes on combining the Foursquare and Paavo data

---

1. In this study, Paavo data is the primary data, that is then used to get the respective Foursquare data.  Venue data is used to augment the postal code area statistics.  These data are joined together by location data.

2. Only venue category data is combined to Paavo data.  That is, for each postal code area we add information about how many venues of each type there is.  For example 2 cafes and 5 bus stops.

3. Time span of data: for Paavo statistics, we know the year for each variable, but for Foursquare venue data we cannot be certain how it matches the statistics.  For example, was the venue present in the location at the time of the statistics information?  Since we cannot know for sure, we need to assume that in general the timespans match _well enough_ - without being able to specify what is well enough.

4. To ease focusing on our research question, create additional summaries of cafe venues and restaurant venues.


# 3. Methodology

---

<div style="color:blue">

_Methodology section represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, and what machine learnings were used and why._
</div>

-----

## 3.1 Exploratory data analysis


The raw data understanding was already introduced in section 2. about the data.  This section concentrates on exploring how the combined data looks and what could be done with it.

- correlations
- linear regression prediction (using mostly correlating factors as input?)
- clustering...

### analyse venue data, how good it is, "can it be trusted", "is it representative enough"

By default we should be able to trust Paavo data, as it is collected by official authorities.  However, Foursquare data is born out of voluntary users, so analyze a little to see how does it seem up to this task

- for each postal code area count how many venues it shares with another postal code area
    - overlap is a good sign, then we do not miss any venues because of the FS_LIMIT and Radius values.
    - display a graph / statistics how much overlap there is
- for each postal code area, calculate average and mean distances of it's venues from its' center.
    - if close to radius, then we might have a problem


### Clustering (postal code areas and cities)

- find best k, visualize showing how the distribution of cluster sizes vary over k
- Cluster based on combined paavo and venue data on postal code level
- cluster cities directly (city level summaries)


### Understanding clusters

Postal code area clusters:
- Most common venue categories in clusters (sums on cluster level)
    - does not care if one large postal code area dominates
- Most common venue categories in clusters (average over postal code area proportions)
    - each postal code area has equal effect.
- typical Paavo data for each cluster:
    - average population, surface area etc.
- find out any linear correlations within clusters

City clusters:
- same?
- find out any linear correlations within clusters



### Summary of Paavo and Venue data on city level

- just grouping data from postal code level to city levels
- include clusters on city level (both direct city clusters and distribution of postal code area clusters within city)
- To continue with validating Venue data, check if venue total/cafe/restaurant counts etc correlate with any Paavo data like area size, population size etc or other venues
    - visual inspection?
    - calculate correlations
    - try linear regression prediction
    - Finland level, city cluster level, postal code area cluster level
- find out any linear correlations within cities
- Identify top 20 cities, see how averages compare, visualizations on selected metrics.
    - population size, surface area, income average, total venues?, total cafes, total restaurants, selected jobs on area
    - pie chart of population among top 20 - top 50 cities
    - proportions of cafes / restaurants to population, net income, jobs etc.
- citywise cluster percentages


### Visual inspection of clusters on map (in cities + over finland)




## 3.2 Statistical testing

## 3.3 Machine learning techniques

# 4. Results

---

_To be filled in later in the second part of this assignment..._

# 5. Discussion

---

_To be filled in later in the second part of this assignment..._

# 6. Conclusion

---

_To be filled in later in the second part of this assignment..._