# 03 - Interactive Viz

## Assignment

1. Go to the eurostat website and try to find a dataset that includes the european unemployment rates at a recent date.

Use this data to build a Choropleth map which shows the unemployment rate in Europe at a country level. Think about the colors you use, how you decided to split the intervals into data classes or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.
- Go to the amstat website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

HINT: Go to the details tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through.
Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

- Use the amstat website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between Swiss and foreign workers.

The Economic Secretary (SECO) releases a monthly report on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for foreign (5.1%) and Swiss (2.2%) workers.

Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (hint The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?

Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

- BONUS: using the map you have just built, and the geographical information contained in it, could you give a rough estimate of the difference in unemployment rates between the areas divided by the Röstigraben?

## 1. European Unemployment Rate
---
<div class="alert alert-block alert-info">
Go to the eurostat website and try to find a dataset that includes the european unemployment rates at a recent date. <br><br>

Use this data to build a Choropleth map which shows the unemployment rate in Europe at a country level. Think about the colors you use, how you decided to split the intervals into data classes or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.
</div>

First, we need to import the libraries used, which are the usual ones, such as *pandas* and *numpy*, but also *folium* for generating the required maps.

In [1]:
import pandas as pd
import numpy as np
import jenkspy
import branca
import folium
import os
import json
import copy
from IPython.display import IFrame
%matplotlib inline
import matplotlib.pyplot as plt

Finding the data on the *eurostat* website is straightforward, and also there are metadata files available, explaining exactly the definition of *employed* and *unemployed* persons, along with other terms used in the datasets. <br>

In the same time, having a lot of information available leads to the situation in which we might get confused and be really unsure what dataset to use. This is the case for the first exercise, where we have the following available branches under Employment and Unemployment domain <a href="http://ec.europa.eu/eurostat/cache/metadata/EN/employ_esms.htm">[source]</a>:
- 'LFS main indicators' consists of a selection of the most important monthly, quarterly and annual  labour market indicators, most of them based on EU-LFS.
- 'LFS series - detailed quarterly survey results' and 'LFS series - detailed annual survey results' is a more comprehensive selection of data from the EU-LFS.
- 'LFS series - specific topics' report annual regional data (NUTS III) and annual data on households (both households demographics and labour market results by household type).
- 'LFS series - adhoc modules' report annual results for some EU-LFS adhoc modules. Other adhoc module results are published in other domains where they better fit (e.g. in education statistics or health statistics).
---

Looking more carefully at the four branches and the data presented inside, we realize that actually only the first two branches could be helpful in our case. The decision between the two of them might be tough, but we decided to continue with the data in the *LFS main indicators* branch. The main reason for that decision is because in the branch *LFS series - detailed quarterly survey results*, the data presented is the raw data collected every three months from surveys conducted in each country. In the same time, data in the branch *LFS main indicators* is computed from the data in the first branch, but also adjusted and enriched, as presented <a href="http://ec.europa.eu/eurostat/cache/metadata/de/une_esms.htm#stat_pres1496733880223">here</a>. Therefore, we consider that the data in the first branch is selected more carefully and might be more close to reality than in the second branch. We downloaded the dataset representing the unemployment rates by sex and age and import it in a pandas DataFrame below:

In [2]:
UNEMPLOYMENT_MONTHLY_FILE = os.path.join('.', 'data', 'une_rt_m.tsv')
UNEMPLOYMENT_QUARTERLY_FILE = os.path.join('.', 'data', 'une_rt_q.tsv')
UNEMPLOYMENT_YEARLY_FILE = os.path.join('.', 'data', 'une_rt_a.tsv')


unemployed = pd.read_csv(UNEMPLOYMENT_MONTHLY_FILE, sep='\t')
unemployed.head()

Unnamed: 0,"s_adj,age,unit,sex,geo\time",2017M09,2017M08,2017M07,2017M06,2017M05,2017M04,2017M03,2017M02,2017M01,...,1983M10,1983M09,1983M08,1983M07,1983M06,1983M05,1983M04,1983M03,1983M02,1983M01
0,"NSA,TOTAL,PC_ACT,F,AT",5.1,5.2,4.4,4.9,5.1,4.7,5.4,4.9,5.3,...,:,:,:,:,:,:,:,:,:,:
1,"NSA,TOTAL,PC_ACT,F,BE",7.2,7.7,7.6,6.8,6.9,7.4,8.0,8.2,7.8 b,...,:,:,:,:,:,:,:,:,:,:
2,"NSA,TOTAL,PC_ACT,F,BG",5.3,5.3,5.3,5.3,5.6,6.2,6.8,7.1,7.0,...,:,:,:,:,:,:,:,:,:,:
3,"NSA,TOTAL,PC_ACT,F,CY",10.7,12.2,12.4,11.4,10.7,11.4,13.1,13.9,14.0,...,:,:,:,:,:,:,:,:,:,:
4,"NSA,TOTAL,PC_ACT,F,CZ",3.5,3.5,3.4,3.5,3.5,4.2,4.4,4.3,4.2,...,:,:,:,:,:,:,:,:,:,:


Looking at the resulted data frame, we can see that it contains monthly data available since January 1983, with each month represented as a column in the DataFrame. Because the exercise explicitly requires the unemployment rates at a recent date, we decided to keep only the last 12 months in the DataFrame, as below:

In [3]:
columns_to_drop = list(range(13, len(unemployed.columns))) 
# drop all the columns with index bigger or equal to 13, i.e. keep the first 13 columns
unemployed = unemployed.drop(unemployed.columns[columns_to_drop], axis=1)

In [4]:
unemployed.head()

Unnamed: 0,"s_adj,age,unit,sex,geo\time",2017M09,2017M08,2017M07,2017M06,2017M05,2017M04,2017M03,2017M02,2017M01,2016M12,2016M11,2016M10
0,"NSA,TOTAL,PC_ACT,F,AT",5.1,5.2,4.4,4.9,5.1,4.7,5.4,4.9,5.3,4.9,5.5,5.4
1,"NSA,TOTAL,PC_ACT,F,BE",7.2,7.7,7.6,6.8,6.9,7.4,8.0,8.2,7.8 b,6.8,6.5,6.6
2,"NSA,TOTAL,PC_ACT,F,BG",5.3,5.3,5.3,5.3,5.6,6.2,6.8,7.1,7.0,6.6,6.5,6.4
3,"NSA,TOTAL,PC_ACT,F,CY",10.7,12.2,12.4,11.4,10.7,11.4,13.1,13.9,14.0,14.2,14.4,12.3
4,"NSA,TOTAL,PC_ACT,F,CZ",3.5,3.5,3.4,3.5,3.5,4.2,4.4,4.3,4.2,4.2,4.2,4.3


Furthermore, if we look at the first column, we realize it is composed of multiple fields, which are described below:

1. s_adj: seasonal adjustment, which has the following possible values:
    * NSA = Unadjusted data (i.e. neither seasonally adjusted nor calendar adjusted data).
    * SA = Seasonally adjusted data, not calendar adjusted data.
    * TC = Trend cycle data.
    
2. age: age of the people in the category, which has the following possible values:
    * TOTAL = Everyone is included in the category.
    * Y_LT25 = Less than 25 years.
    * Y25-74 = From 25 to 74 years.
    
3. unit: unit of measure, which has the following possible values:
    * THS_PER = Thousand persons
    * PC_ACT = 	Percentage of active population
 
4. sex: sex of the people in the category, which has the following possible values:
    * T = Total (everyone is included)
    * M = Males
    * F = Females
    
5. geo: The country or the area in the category. Has as possible values the country codes for the european countries, as well as for European Union or Euro Area. 
---

As specified before, we are interested only in the seasonally adjusted data, without taking into consideration differences in age and sex. Also, we want the unemployment *rate* and not the total number of unemployed people. Thus, only rows for which the first column value starts with the string *SA,TOTAL,PC_ACT,T,* are taken into consideration. Also, from the first column we will keep only the country code, because all the other information is the same for all rows. The table after the specified modifications is constructed below:

In [5]:
select_rows = unemployed["s_adj,age,unit,sex,geo\\time"].map(lambda x: x.startswith("SA,TOTAL,PC_ACT,T,"))
unemployed = unemployed[select_rows] # select only specified rows

unemployed.rename(columns={"s_adj,age,unit,sex,geo\\time" : "country_code"}, inplace=True) # rename first column
unemployed["country_code"] = unemployed["country_code"].map(lambda x: x[18:]) 
# delete the first 18 characters of the first column, i.e. keep only the country code

unemployed = unemployed.reset_index(drop=True) # reset index from 0

In [6]:
unemployed.head(10)

Unnamed: 0,country_code,2017M09,2017M08,2017M07,2017M06,2017M05,2017M04,2017M03,2017M02,2017M01,2016M12,2016M11,2016M10
0,AT,5.6,5.5,5.4,5.3,5.4,5.6,5.7,5.8,5.7,5.7,5.8,5.9
1,BE,7.1,7.3,7.3,7.2,7.3,7.4,7.6,7.7,7.6 b,7.2,7.2,7.2
2,BG,6.1,6.2,6.1,6.1,6.1,6.2,6.4,6.6,6.6,6.7,6.8,7.0
3,CY,10.3,10.6,10.7,10.9,11.3,11.6,12.1,12.5,12.7,13.2,13.4,13.1
4,CZ,2.7,2.8,2.8,2.9,3.0,3.3,3.2,3.3,3.3,3.5,3.7,3.7
5,DE,3.6,3.6,3.7,3.7,3.8,3.8,3.9,3.9,3.9,3.9,3.9,4.0
6,DK,5.7,5.7,5.8,5.7,5.7,5.7,5.9,6.1,6.1,6.1,6.3,6.5
7,EA,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8
8,EA18,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.7,9.8,9.8
9,EA19,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8


Let us look at the data types of each column:

In [7]:
unemployed.dtypes

country_code    object
2017M09         object
2017M08         object
2017M07         object
2017M06         object
2017M05         object
2017M04         object
2017M03         object
2017M02         object
2017M01         object
2016M12         object
2016M11         object
2016M10         object
dtype: object

We want to transform the data to numeric in all the columns except country_code, after replacing the missing data, marked with ":", with NaN value. Also, we change the value of unemployment rate during January 2017 in Belgium from *7.6 b* to *7.6*.

In [8]:
unemployed = unemployed.applymap(lambda x: np.NaN if x == ": " else x) # replace : with pd.NaN
unemployed.set_value(1, '2017M01 ', 7.6)

for column in unemployed.columns[1:]:
    unemployed[column] = unemployed[column].map(pd.to_numeric)

In [9]:
unemployed.head(10)

Unnamed: 0,country_code,2017M09,2017M08,2017M07,2017M06,2017M05,2017M04,2017M03,2017M02,2017M01,2016M12,2016M11,2016M10
0,AT,5.6,5.5,5.4,5.3,5.4,5.6,5.7,5.8,5.7,5.7,5.8,5.9
1,BE,7.1,7.3,7.3,7.2,7.3,7.4,7.6,7.7,7.6,7.2,7.2,7.2
2,BG,6.1,6.2,6.1,6.1,6.1,6.2,6.4,6.6,6.6,6.7,6.8,7.0
3,CY,10.3,10.6,10.7,10.9,11.3,11.6,12.1,12.5,12.7,13.2,13.4,13.1
4,CZ,2.7,2.8,2.8,2.9,3.0,3.3,3.2,3.3,3.3,3.5,3.7,3.7
5,DE,3.6,3.6,3.7,3.7,3.8,3.8,3.9,3.9,3.9,3.9,3.9,4.0
6,DK,5.7,5.7,5.8,5.7,5.7,5.7,5.9,6.1,6.1,6.1,6.3,6.5
7,EA,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8
8,EA18,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.7,9.8,9.8
9,EA19,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8


Also, we will set the *country_code* as index, and moreover will transform columns names in integers 1-12, corresponding to the month index:

In [10]:
unemployed = unemployed.set_index('country_code')

renamed_columns_dict = {old : int(old[5:-1]) for old in unemployed.columns}
unemployed = unemployed.rename(columns=renamed_columns_dict)

In [11]:
unemployed.head()

Unnamed: 0_level_0,9,8,7,6,5,4,3,2,1,12,11,10
country_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
AT,5.6,5.5,5.4,5.3,5.4,5.6,5.7,5.8,5.7,5.7,5.8,5.9
BE,7.1,7.3,7.3,7.2,7.3,7.4,7.6,7.7,7.6,7.2,7.2,7.2
BG,6.1,6.2,6.1,6.1,6.1,6.2,6.4,6.6,6.6,6.7,6.8,7.0
CY,10.3,10.6,10.7,10.9,11.3,11.6,12.1,12.5,12.7,13.2,13.4,13.1
CZ,2.7,2.8,2.8,2.9,3.0,3.3,3.2,3.3,3.3,3.5,3.7,3.7


We must also explain what these number represent. As presented in the metadata quoted above, the table is populated with the monthly unemployment rate, which is calculated as the number of people unemployed as a percentage of the labour force. The *labour force* is the total number of people employed and unemployed.

We must also define the concepts of employed and unemployed persons below:

* **Unemployed persons** are all persons 15 to 74 years of age (16 to 74 years in ES, IT and the UK) who were not employed during the reference week, had actively sought work during the past four weeks and were ready to begin working immediately or within two weeks. 

The duration of unemployment is defined as the duration of a search for a job or as the length of the period since the last job was held (if this period is shorter than the duration of search for a job).

* **Employed persons** are all persons who worked at least one hour for pay or profit during the reference week or were temporarily absent from such work. For the unemployment rate, only persons from 15 to 74 years of age are used.

Therefore, we can proceed to constructing the map of unemployment for European countries. First inconvenience we encounter is that some of the countries in the topojson are not in the European Union, therefore they do not appear in the *unemployed* table. Thus, if we try to create the Choropleth map with the current DataFrame, the code will throw an error. In order to solve that problem, we need to also add the countries which are not in the European Union in the DataFrame, having only NaN values for the unemployment rate. 

In [12]:
TOPOGRAPHY_FILE = os.path.join(".", "topojson", "europe.topojson.json")

topojson = json.load(open(TOPOGRAPHY_FILE))
country_codes_topojson = [item['id'] for item in topojson["objects"]["europe"]["geometries"]]
# country_codes_topojson represents the codes of the country in the topojson

set(unemployed.index).difference(set(country_codes_topojson))

{'EA', 'EA18', 'EA19', 'EL', 'EU25', 'EU27', 'EU28', 'JP', 'UK', 'US'}

As we can see in the previous output, we have some normal values which appear in the DataFrame but don't appear in the topojson because they are not european country codes, such as *EA, EA18,EA19, EU25, EU27, EU28, JP, US*. But, in the same time, we have two European countries which seem to appear in the *unemployed* DataFrame, but not in the topojson. At a closer look, we can see that *EL* is the country code for Greece, which is denoted by *GR* in the topojson and *UK* is the country code for United Kingdom, which is denoted by *GB* in the topojson. Therefore, we will modify the country code for those two countries in the DataFrame below:

In [13]:
index_list = unemployed.index.tolist()

greece_index = index_list.index('EL')
index_list[greece_index] = 'GR'

uk_index = index_list.index('UK')
index_list[uk_index] = 'GB'

unemployed.index = index_list
unemployed['country_code'] = unemployed.index

set(unemployed.index).difference(set(country_codes_topojson))

{'EA', 'EA18', 'EA19', 'EU25', 'EU27', 'EU28', 'JP', 'US'}

Next, for constructing the Choropleth map, we need to determine the Data Classification method, i.e. cluster the unemployment rates into a small number of classes, like 5 or 6, and then use the right colors to represent it. As we may see <a href="http://gisgeography.com/choropleth-maps-data-classification/">here</a>, one of the best ways of creating classes is the Jenks Natural Breaks. The goal of this method is to construct clusters which have the minimum variations inside, but also the cluster means are at maximum distance. For example, in our case the outliers (such as Greece) need to be represented in a different class, to show the big difference in unemployment compared to the other countries. 

---
For using Jenks Natural breaks, we will install a library called jenkspy, with the command:
* conda install -c conda-forge jenkspy

Then, we will try to see what is the output of the algorithm when we want to have 5 clusters or 6 of them, for July 2017. We choose July 2017 because we don't have missing values for that month in the initial DataFrame:

In [14]:
jenks_breaks_5 = jenkspy.jenks_breaks(unemployed[7].dropna(), nb_class=5)
jenks_breaks_6 = jenkspy.jenks_breaks(unemployed[7].dropna(), nb_class=6)

jenks_breaks_5, jenks_breaks_6

([2.8, 4.3, 6.5, 9.0, 11.3, 21.0], [2.8, 4.3, 6.5, 9.0, 11.3, 16.9, 21.0])

We can see that the two lists differ only at the very end, where in the bigger list we add one intermediate element. We will take the decision to use the Jenks breaks with 6 clusters, because we want to have a more granular distinction for countries with unemployment between 11.3% and 21.0%. This is because we consider that the difference between the last bin's edges is very big, i.e. almost 10%. Therefore, we will use a different colors for the outliers. 

In [15]:
bins_edges = jenks_breaks_6

Also, knowing that we decided to use 6 classes, it means that we will need 6 colors to represent the different classes. For that, we use the <a href="http://colorbrewer2.org/#type=sequential&scheme=YlOrBr&n=6">colorbrewer</a> website, and we will decide for a yellow-brown theme. Also, we have to mark the european countries which are not in the European Union, and the EU contries that have missing data, with a different color, that may suggest that the data is missing. 

For being able to distinguish between the missing data and existing data easily, we split the topojson given into two different topojsons: one only with countries that appear in the DataFrame AND contain data on a specific column, and the other one only with countries that do not appear in the Dataframe OR have missing data in the column. The functions that realize the split are written below:

In [16]:
def country_in_dataframe_and_not_missing_value(json_entry, month):
    '''
    Function that returns True iff the country defined by the json entry appears in the unemployed DataFrame and has no missing
    value for the specified month.
    '''
    country_codes = set(unemployed.index)
    
    if json_entry["id"] not in country_codes:
        return False
    
    return not pd.isnull(unemployed.loc[json_entry["id"], month])

In [17]:
def construct_topojson_filter(filter_function):
    '''
    Function that computes a topojson from the initial one, only with the countries that pass the filtering function
    '''
    
    topojson_filtered = json.load(open(TOPOGRAPHY_FILE))
    
    filtered_entries_list = filter(filter_function, topojson["objects"]["europe"]["geometries"])
    topojson_filtered["objects"]["europe"]["geometries"] = list(filtered_entries_list)
    
    return topojson_filtered

In [18]:
month_dictionary_english = {1: 'January ',
                           2: "February ", 
                           3: "March ",
                           4: "April ", 
                           5: "May ", 
                           6: "June ",
                           7: "July ",
                           8: "August ", 
                           9: "September ",
                           10: "October ", 
                           11: "November ",
                           12: "December "}

Next, we write a function that will compute the Choropleth map for a specific month given as parameter. For that, we first split the topojson into two topojsons, using the functions described above. Then, we cluster the data in 6 clusters using Jenks Natural Breaks and compute the Choropleth map using this data and a color scheme from *colorbrewer* website. Then, for all the countries which have missing data or are not in the European union, we will apply a new color on top of the computed map, which will be in a big contrast with all of the other colors used for expressing unemployment rates, such that the user can easily make the difference between the missing data and available one. The function is written below:

In [19]:
def create_choropleth_for_month(month, m=None):
    if month > 9:
        year = '2016'
    else:
        year = '2017'        
    
    bins_edges = jenkspy.jenks_breaks(list(unemployed[month].dropna()), nb_class=6)[:-1] # create the bins, without the rightmost edge
    
    if m is None:
        m = folium.Map(location=[55, 10], tiles='Mapbox Bright', zoom_start=3.4)
    topojson_common = construct_topojson_filter(lambda x: country_in_dataframe_and_not_missing_value(x, month))
    topojson_distinct = construct_topojson_filter(lambda x: not country_in_dataframe_and_not_missing_value(x, month))

    # Construct Choropleth map with the countries in the DataFrame that don't have missing data for the month
    m.choropleth(
    geo_data=topojson_common,
    data=unemployed,
    columns=['country_code', month],
    key_on='feature.id',
    fill_color='YlOrBr',
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name='Unemployment Rate for ' + month_dictionary_english[month] + year + '(%)',
    topojson='objects.europe',
    threshold_scale=bins_edges,
    highlight=True)
    
    # for all the European countries that are not in EU OR have missing data for the specified month, fill them with blue
    folium.TopoJson(
    data=topojson_distinct,
    object_path='objects.europe',
    style_function=lambda feature: {
        'fillColor': '7a0177',
        'fill_opacity' : 0.8,
        'color' : 'black',
        'weight' : 0.2}
    ).add_to(m)
    
    return m

Then, let's take a look at the actual unemployment DataFrame, to decide which month will be taken into consideration:

In [20]:
unemployed

Unnamed: 0,9,8,7,6,5,4,3,2,1,12,11,10,country_code
AT,5.6,5.5,5.4,5.3,5.4,5.6,5.7,5.8,5.7,5.7,5.8,5.9,AT
BE,7.1,7.3,7.3,7.2,7.3,7.4,7.6,7.7,7.6,7.2,7.2,7.2,BE
BG,6.1,6.2,6.1,6.1,6.1,6.2,6.4,6.6,6.6,6.7,6.8,7.0,BG
CY,10.3,10.6,10.7,10.9,11.3,11.6,12.1,12.5,12.7,13.2,13.4,13.1,CY
CZ,2.7,2.8,2.8,2.9,3.0,3.3,3.2,3.3,3.3,3.5,3.7,3.7,CZ
DE,3.6,3.6,3.7,3.7,3.8,3.8,3.9,3.9,3.9,3.9,3.9,4.0,DE
DK,5.7,5.7,5.8,5.7,5.7,5.7,5.9,6.1,6.1,6.1,6.3,6.5,DK
EA,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8,EA
EA18,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.7,9.8,9.8,EA18
EA19,8.9,9.0,9.0,9.1,9.2,9.2,9.4,9.5,9.6,9.6,9.7,9.8,EA19


We can see that for September 2017 and August 2017 we have missing values for some countries, and most important for Greece, which is actually the country with the biggest unemployment rate in the last months. Therefore, we decide to show the unemployment map for July 2017, because this is the most recent date for which we have data for all the European countries:

In [21]:
EUROPE_UNEMPLOYMENT_RATE = os.path.join('.', 'out', 'europe_unemployment_rate.html')

m = create_choropleth_for_month(7)
m.save(EUROPE_UNEMPLOYMENT_RATE)

IFrame(src=EUROPE_UNEMPLOYMENT_RATE,width=900, height=600)

In order to make the map more interactive, we can enable popups when we click inside a country, which will display the actual unemployment rate for the specific country in the analyzed month. The code that enables the feature is written below:

In [22]:
def construct_dict_country_code_to_country_name():
    result = {}
    topojson = json.load(open(TOPOGRAPHY_FILE))
    for item in topojson["objects"]["europe"]["geometries"]:
        result[item["id"]] = item["properties"]["NAME"]
    
    return result

In [23]:
country_code_to_country_name = construct_dict_country_code_to_country_name()

In [24]:
def construct_popup_text(month, country_code, rate):
    '''
    Function that returns the text to be written in a specific country area.
    '''
    country = country_code_to_country_name[country_code]
    
    if month > 9:
        year = '2016'
    else:
        year = '2017'
        
    html = "<h3>" + country + " unemployment rate in " + month_dictionary_english[month] + " " + year + " is: "
    html += str(rate) + "%.</h3><br>"
    
    return html

For enabling the function mentioned before, we need to do a special trick: for each country, create the topojson with only that country (basically, deleting all other countries from the topojson), then for each constructed topojson, assign a popup to it and then add to the previously constructed map. All the functions are implemented below:

In [25]:
def construct_countries_topojsons():
    result = []
    
    topojson = json.load(open(TOPOGRAPHY_FILE))
    
    for item in topojson["objects"]["europe"]["geometries"]:
        copy_topojson = copy.deepcopy(topojson)
        copy_topojson["objects"]["europe"]["geometries"] = [item]
        
        result.append((item["id"], copy_topojson)) # pair (country, country_topojson)
        
    return result

In [26]:
def construct_popups_topojsons(month):
    '''
    Function that creates all the topojsons with popups for a specific month.
    '''
    
    eu_countries = set(unemployed.index)
    result = []
    
    countries_topojsons = construct_countries_topojsons()
    for country_code, country_topojson_dict in countries_topojsons:
        if (not (country_code in eu_countries)) or np.isnan(unemployed.loc[country_code, month]):
            rate = 'unknown'
        else :
            rate = unemployed.loc[country_code, month]
        
        popup_html = construct_popup_text(month, country_code, rate)
        popup = folium.Popup(html=popup_html, max_width=500)
        
        country_topojson = folium.TopoJson(
            country_topojson_dict, 
            'objects.europe',
            name=country_code,
            style_function=lambda feature:{'fill_opacity':0.01, 'line_opacity':0.01, 'line_color':'black', 'weight':0.01, 'color':'white'})
        
        country_topojson.add_child(popup)
        
        result.append(country_topojson)
        
    return result

In [27]:
def create_choropleth_for_month_interactive(month, m=None):
    if m is None:
         m = folium.Map(location=[55, 10], tiles='Mapbox Bright', zoom_start=3.4)
    
    m = create_choropleth_for_month(month, m)
    
    countries_topojsons_popups = construct_popups_topojsons(month)
    for country_topojson in countries_topojsons_popups:
        country_topojson.add_to(m)
        
    return m

In [28]:
EUROPE_UNEMPLOYMENT_RATE_INTERACTIVE = os.path.join('.', 'out', 'europe_unemployment_rate_interactive.html')

# m = create_choropleth_for_month_interactive(7)
# m.save(EUROPE_UNEMPLOYMENT_RATE_INTERACTIVE)

IFrame(src=EUROPE_UNEMPLOYMENT_RATE_INTERACTIVE,width=900, height=600)

For comparing the european countries with Switzerland, we can add the information for Switzerland for July 2017 in the *unemployed* DataFrame and then compute the map again:

In [29]:
data_switzerland = pd.read_excel('2_1 Taux de chômage.xlsx')
unemployment_rate_switzerland = pd.to_numeric(data_switzerland[data_switzerland.Canton == 'Total']['Juillet 2017'].values[0])

row = {month:unemployment_rate_switzerland if month==7 else np.NaN for month in range(1, 13)}
row['country_code'] = 'CH'

row = pd.Series(row, name='CH')
unemployed = unemployed.append(row)



We want to see below what are the countries with the lowest unemployment rate in Europe. We find out that the first two countries are Czech Republic and Iceland, followed closely by Switzerland, with 3% unemployment rate. Compared to all the european countries, it seems like Switzerland is in the top 3 countries, when we speak about unemployment rate.

In [30]:
unemployed[7].sort_values().head()

CZ    2.8
JP    2.8
IS    2.9
CH    3.0
DE    3.7
Name: 7, dtype: float64

Lastly, we want to present the map for European Union and Switzerland, as the end of the exercise, knowing that an image worths as 1000 words:

In [31]:
EUROPE_CH_UNEMPLOYMENT_RATE_INTERACTIVE = os.path.join('.', 'out', 'europe_CH_unemployment_rate_interactive.html')

m = create_choropleth_for_month_interactive(7)
m.save(EUROPE_CH_UNEMPLOYMENT_RATE_INTERACTIVE)

IFrame(src=EUROPE_CH_UNEMPLOYMENT_RATE_INTERACTIVE,width=900, height=600)

## Exercise 2

Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through. 

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.


# Used libraries

In [32]:
import os
import pandas as pd
import folium
import numpy as np
import jenkspy

from IPython.display import display

# Understanding the data

In order to solve our task we have to build two Choropleth maps presenting two different interpretations of the unemployment rate:
* First map should express the rate of people registered as jobseekers for each canton

* Second map should exclude from the statistics the people that are already employed

Amstat provides the rate of unemployment defined as number of **chômeurs inscrits** divided by the **personnes actives**, as can be seen in the image below captured from the [definitions section of the amstat site](https://www.amstat.ch/v2/definition.jsp?lang=fr) (mouse hover for English translation - courtesy of Google Translate)

![](images/taux_de_chomage.png "Number of registered unemployed at the reference day (last day of the month) divided by the number of active persons, multiplied by 100. The number of active persons is recorded each year by the Federal Statistical Office as part of the Structural Survey ( census of the population). Since January 1, 2014, it has risen to 4'493'249 according to the three-year pooling of data collected in the framework of the 2012-2014 structural surveys. Active persons used by SECO also includes diplomats and international civil servants domiciled in Switzerland.")

Looking further into the definitions of **chômeurs inscrits**, we find that this represents the number of jobseekers that are unemployed. Again, with the courtesy of Google Translate, mouse over for english translation.

![](images/chomeurs_inscrits.png "Persons registered at regional employment agencies, who are unemployed and immediately available for placement. It does not matter whether they are receiving unemployment benefits or not.")


Consulting the definitions on the amstat, we find that **personnes actives** (active population) is consisted of the employed and unemployed individuals. Here we encounter our first limitation of the dataset as we do not know more details about the active population, for example the age range.

![](images/personnes_actives.png "Employed persons (at least one hour per week) or unemployed.The unemployment rate is calculated by taking the number of active persons as the denominator. Breakdown by regions, cantons, nationalities, age groups and by sex, the number of active persons influences various tables of SECO's labor market statistics. Exception: For the economic branches, the unemployment rate is not calculated on the basis of the number of active persons, but of the number of active persons employed.
Since 2010, the Federal Statistical Office (FSO) has been counting the number of active persons per year in the framework of the Structural Survey on the active life of the population.
The completion of an annual structural survey makes it possible to cumulate the results over a period of several years (pooling). The advantage of this way of proceeding is to have wider data
in the field of active persons. Since 1 January 2014, SECO has no longer calculated the unemployment rate on the basis of the number of active persons dating from 2010, but on the
of their number determined in the context of pooling over the period 2012 to 2014 based on data from the structural survey.
Using the sampling method allows a more regular adjustment of the denominator of the unemployment rate than the method previously used based on the population census
(exhaustive survey carried out every ten years). The number of active persons on which SECO is based also includes diplomats and international civil servants residing in Switzerland.
(Before 31 December 1999, the unemployment rate was calculated on the basis of the number of persons engaged in gainful employment of at least six hours per week, which is no longer available).")

We can then remark that the unemployment rate aggregates the data corresponding to the second map, meaning it takes into consideration only jobseekers that currently do not have job.

For the first map, amstat does not provide the necessary rates but we decide to compute them based on the unemployment rate provided and the number of jobseekers to which we have access.

Again, from the Definitions section, we can notice that **Demandeurs d'emploi** represent the number of jobseekers, regardless of whether they are employed or not. This is why, in the next section, we will check this indicator when exporting the data.

![](images/demandeus_d_emploi_inscrits.png "All job seekers, unemployed and non-unemployed, who are registered with regional employment agencies and are looking for work.") 

# Getting the data

As indicated in the statement, we head over to the [amstat (fr)](https://www.amstat.ch) to get the data. 

Accessing the Details section, we are presented with multiple options out of which the **Chomeurs et demandeurs d'emploi** (unemployed and jobseekers) is the one of interest here. After accesing the previously mentioned option, the section **Taux de chomage** (unemployment rate) allows us to get the data we want. Accessing it, we are presented with multiple indication for the export process. Among these indicators, the following are of interest to us:

> **Taux de chomage** (unemployment rate) computed as **Chomeurs inscrit** **/** **personnes actives** scaled by 100. This is the key statistic for the second map
>
> **Demandeurs d'employ** (jobseekers) - this is necessary for computing the unemployment rate for the first map
>
> **Chomeurs inscrit** (unemployed jobseekers) - necessary for computing the unemployment rate for the first map as described below

We choose to get the unemployment data only for **month september** as we want to focus on the current status of Switzerland. The limitations of this decision are that we don't have an yearly context to spot, for example, seasonal trends and so we cannot make general statements about the unemployment rate based only on this data, but only related to month september

There are multiple indicators when choosing what data to export and for the purposes of our task this is what the exported indicators look like:

![](images/indicators_1.png)

<a id='computing_data_first_map'></a>
## Computing data for the first map

By the definition, 

>taux de chomage = chomeurs inscrit / personnes actives * 100

For the second map, we want the following new unemployment rate

>new taux de chomage = demandeurs d'emploi / personnes actives * 100

which can be obtain from the first one with the following formulae:

>new taux de chomage = (demandeurs d'emploi)\*(taux de chomage) / (chomeurs inscrit)

# Working with the data


There are multiple things we can notice when glancing at the data:
* the first row containes additional information about some of the columns, therefore we will want to incorporate this information in the column name

* column Mois seems to have only NaN values so we will want to drop it as it bring no information 

In [33]:
data = pd.read_excel('2_1 Taux de chômage Septembre.xlsx')

In [34]:
data.head()

Unnamed: 0,Canton,Mois,Septembre 2017,Septembre 2017.1,Septembre 2017.2,Septembre 2017.3,Septembre 2017.4,Total,Total.1,Total.2,Total.3,Total.4
0,,Mesures,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
1,Zurich,,3.3,A,27225,34156,6931,3.3,A,27225,34156,6931
2,Berne,,2.4,A,13658,18385,4727,2.4,A,13658,18385,4727
3,Lucerne,,1.7,A,3885,6756,2871,1.7,A,3885,6756,2871
4,Uri,,0.6,C,112,257,145,0.6,C,112,257,145


## Inspecting the data

As presumed, the column *Mois*, excepting the first row that does not contain actual unemployment data, all of the values are NaNs so we decide to drop it

In [35]:
data[1:]['Mois'].any()

False

In [36]:
data = data.drop('Mois', axis=1)
data.head()

Unnamed: 0,Canton,Septembre 2017,Septembre 2017.1,Septembre 2017.2,Septembre 2017.3,Septembre 2017.4,Total,Total.1,Total.2,Total.3,Total.4
0,,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
1,Zurich,3.3,A,27225,34156,6931,3.3,A,27225,34156,6931
2,Berne,2.4,A,13658,18385,4727,2.4,A,13658,18385,4727
3,Lucerne,1.7,A,3885,6756,2871,1.7,A,3885,6756,2871
4,Uri,0.6,C,112,257,145,0.6,C,112,257,145


## Creating meaningful columns 

Having descriptional information on the first row (as exported by the amstat site) we plan on transforming the dataframe by creating hierarchical columns where the first level is the month and the second level is the aditional information from the first row.

For this purpose, we create a separate DataFrame with two columns: the month and the corresponding information. We will use this dataframe to create a MultiIndex which will be set as columns.

We use a regular expression to extract the month and the year from the column names or the word *Total* for the last columns.

**Note**:
A problematic case is the Canton column where the first row has no additional information. We solve this by setting this column as index, we then process the dataframe and afterwards we reset the index so that we have the Canton as column, needed for the Choropleth map.

In [37]:
canton_indexed_data = data.set_index('Canton')
canton_indexed_data.head()

Unnamed: 0_level_0,Septembre 2017,Septembre 2017.1,Septembre 2017.2,Septembre 2017.3,Septembre 2017.4,Total,Total.1,Total.2,Total.3,Total.4
Canton,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
Zurich,3.3,A,27225,34156,6931,3.3,A,27225,34156,6931
Berne,2.4,A,13658,18385,4727,2.4,A,13658,18385,4727
Lucerne,1.7,A,3885,6756,2871,1.7,A,3885,6756,2871
Uri,0.6,C,112,257,145,0.6,C,112,257,145


In [38]:
columns_info = canton_indexed_data.iloc[0].values
canton_indexed_data.drop(canton_indexed_data.index[0], inplace=True)

columns_months = canton_indexed_data.columns.str.extract(r'(.* \d+|Total)', expand=True).values.reshape(columns_info.shape)

canton_indexed_data.columns = pd.MultiIndex.from_arrays((columns_months.tolist(), columns_info.tolist()))

display(canton_indexed_data.head())
display(canton_indexed_data.tail())

Unnamed: 0_level_0,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Total,Total,Total,Total,Total
Unnamed: 0_level_1,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
Canton,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Zurich,3.3,A,27225,34156,6931,3.3,A,27225,34156,6931
Berne,2.4,A,13658,18385,4727,2.4,A,13658,18385,4727
Lucerne,1.7,A,3885,6756,2871,1.7,A,3885,6756,2871
Uri,0.6,C,112,257,145,0.6,C,112,257,145
Schwyz,1.7,A,1455,2229,774,1.7,A,1455,2229,774


Unnamed: 0_level_0,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Total,Total,Total,Total,Total
Unnamed: 0_level_1,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
Canton,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Valais,2.8,A,4816,8027,3211,2.8,A,4816,8027,3211
Neuchâtel,5.1,A,4738,6350,1612,5.1,A,4738,6350,1612
Genève,5.2,A,12234,15497,3263,5.2,A,12234,15497,3263
Jura,4.4,B,1619,2375,756,4.4,B,1619,2375,756
Total,3.0,A,133169,193624,60455,3.0,A,133169,193624,60455


We have the DataFrame columns properly formatted. We now reset the index to have the cantons as a column.

Additionally, we skip the last line in the dataframe as it aggregates the information about the cantons, information which we are not using in our choropleth map.

In [39]:
cantons_data = canton_indexed_data.reset_index()[:-1]
cantons_data.head()

Unnamed: 0_level_0,Canton,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Total,Total,Total,Total,Total
Unnamed: 0_level_1,Unnamed: 1_level_1,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
0,Zurich,3.3,A,27225,34156,6931,3.3,A,27225,34156,6931
1,Berne,2.4,A,13658,18385,4727,2.4,A,13658,18385,4727
2,Lucerne,1.7,A,3885,6756,2871,1.7,A,3885,6756,2871
3,Uri,0.6,C,112,257,145,0.6,C,112,257,145
4,Schwyz,1.7,A,1455,2229,774,1.7,A,1455,2229,774


## Data values types

By calling the info method on the loaded data, we notice that where we would expect the values to be numbers, they are actually objects. This means we have to parse them to numbers in order to use them with the Choropleth map. 

In [40]:
cantons_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
(Canton, )                                            26 non-null object
(Septembre 2017, Taux de chômage)                     26 non-null object
(Septembre 2017, Coefficients de variation)           26 non-null object
(Septembre 2017, Chômeurs inscrits)                   26 non-null object
(Septembre 2017, Demandeurs d'emploi)                 26 non-null object
(Septembre 2017, Demandeurs d'emploi non chômeurs)    26 non-null object
(Total, Taux de chômage)                              26 non-null object
(Total, Coefficients de variation)                    26 non-null object
(Total, Chômeurs inscrits)                            26 non-null object
(Total, Demandeurs d'emploi)                          26 non-null object
(Total, Demandeurs d'emploi non chômeurs)             26 non-null object
dtypes: object(11)
memory usage: 2.3+ KB


We convert just the values corresponding to month September as they are the ones we are working with.

In [41]:
cantons_data[('Septembre 2017', 'Taux de chômage')] = cantons_data[('Septembre 2017', 'Taux de chômage')].astype(float)
cantons_data[('Septembre 2017', 'Demandeurs d\'emploi')] = cantons_data[('Septembre 2017', 'Demandeurs d\'emploi')].astype(float)
cantons_data[('Septembre 2017', 'Chômeurs inscrits')] = cantons_data[('Septembre 2017', 'Chômeurs inscrits')].astype(float)

## Creating the Choropleth map

To create the Choropleth map we will use the choroplet method of the folium.Map class. One thing to notice here is that in the method's defintion the geo_data parameter expects a GeoJSON geometric defintion. Inspecting the documentation, we can see that the choropleth method accepts topojson data in this way:

>TopoJSONs can be passed as "geo_data", but the "topojson" keyword must also be passed with the reference to the topojson objects to convert.

Therefore, we create the json object from the topojson file in order to pass it as the geo_data parameter in the choropleth method.

In [42]:
import json

topo_path = os.path.join('topojson', 'ch-cantons.topojson.json')

with open(topo_path) as json_file:
    topo_data = json.load(json_file)

In order to bind data in the Choropleth map we must make sure that the canton names in the dataset match exactly the canton names in the TopoJson.

With this in mind, we create a dataframe to compare side-by-side the canton names from both sources in the following way:

* We extract the names from the TopoJson which are identified in the Json tree with the following path: objects -> cantons -> geometries -> properties -> name 

* From the amstat dataset, we extract the column containing the cantons' names

In [43]:
topo_cantons_names = []

for geometry in topo_data['objects']['cantons']['geometries']:
    topo_cantons_names.append(geometry['properties']['name'])

We have to plot the entire dataframe beneath in order to make sure the values in the two columns are properly aligned as we will want to use one of the two names lists.

In [44]:
pd.DataFrame(topo_cantons_names, cantons_data['Canton'])

Unnamed: 0_level_0,0
Canton,Unnamed: 1_level_1
Zurich,ZÃ¼rich
Berne,Bern/Berne
Lucerne,Luzern
Uri,Uri
Schwyz,Schwyz
Obwald,Obwalden
Nidwald,Nidwalden
Glaris,Glarus
Zoug,Zug
Fribourg,Fribourg


With a little bit of help from Wikipedia, we remark that values in each row reffer to the same canton, but are in different languages. Knowing this we can change the values in the *Canton* column in the amstat dataset to be the ones extracted from the TopoJson.

We're doing this so that we have no matching problems when we bind the data to the Choropleth map.

In [45]:
cantons_data['Canton'] = topo_cantons_names

In [46]:
cantons_data.head()

Unnamed: 0_level_0,Canton,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Septembre 2017,Total,Total,Total,Total,Total
Unnamed: 0_level_1,Unnamed: 1_level_1,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs,Taux de chômage,Coefficients de variation,Chômeurs inscrits,Demandeurs d'emploi,Demandeurs d'emploi non chômeurs
0,ZÃ¼rich,3.3,A,27225.0,34156.0,6931,3.3,A,27225,34156,6931
1,Bern/Berne,2.4,A,13658.0,18385.0,4727,2.4,A,13658,18385,4727
2,Luzern,1.7,A,3885.0,6756.0,2871,1.7,A,3885,6756,2871
3,Uri,0.6,C,112.0,257.0,145,0.6,C,112,257,145
4,Schwyz,1.7,A,1455.0,2229.0,774,1.7,A,1455,2229,774


## Infer the data for the first map

As explained in the section [Computing data for the first map](#computing_data_first_map), we will now create a new column with the an unemployment rate representing the jobseekers.

In [47]:
cantons_data[('Septembre 2017', 'Taux demandeurs d emploi')] = \
    cantons_data[('Septembre 2017', 'Demandeurs d\'emploi')] * \
    cantons_data[('Septembre 2017', 'Taux de chômage')] /\
    cantons_data[('Septembre 2017', 'Chômeurs inscrits')]

# Exploring visualization differences

We can argue that the unemployment rate can be expressed as the jobseekers divided by the active population. 

This might seem reasonable because the employed people that register as jobseekers might not be content with their current place of work and could quit or might expect to be fired sooner or later.

We now plot this rate in a choropleth map where the classes are chosen to be fixed in order to compare with the next visualizations.

In [48]:
swiss_coord = [46.8827, 8.2178]
legend_scale=[0,2,4.1,6,10]

In [49]:
def add_choropleth_layer(m, column, fill_color, layer_name,
                         legend_name, scale='quantile', nb_classes=3,
                         fixed_thresholds=None):
    '''
    Utility function that allows a easier manipulation of the threshold_scale for the choropleth layer.
    Parameters:
    m: folium map to which the layer is added
    column: specifies the column in the cantons_data that is binded to the map
    fill_color: passed to the choropleth method
    layer_name: passed to the choropleth method as the name of the layer
    legend_name: passed to the choropleth method
    scale: specifies the scale of the choropleth map. Options:
            - quantile: the data is classified in nb_classes of the same approximate size
            - jenks: the data is classified in nb_classes using the Natural Breaks (Jenks) classification
            - fixed: the fixed_thresholds parameter is used to specify the scale
    '''
    if scale == 'quantile':
        _, bins = pd.qcut(cantons_data[column], nb_classes, retbins=True)
        bins = list(bins)
    elif scale == 'jenks':
        bins = jenkspy.jenks_breaks(cantons_data[column], nb_classes)
    elif scale == 'fixed':
        if fixed_thresholds is not None:
            bins = fixed_thresholds
        else:
            raise Exception('Missing fixed thresholds in add choropleth layer')
            
    m.choropleth( 
        geo_data=topo_data,
        topojson='objects.cantons',
        data=cantons_data,
        name=layer_name,
        columns=['Canton',column],
        key_on='feature.properties.name',
        fill_color=fill_color,
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name=legend_name,
        threshold_scale=bins
    )

We would like to show on the map additional information about the cantons like the name of it, the actual value or some confidence measure. In order to do this, we make good use of popups and to be able to add them to the map we implement two helper functions:
* *create_canton_topos*: that returns a TopoJSON for each conton used to add the popup on click per canton
* *add_popups*: that handles the popup content and binding it to the canton and the map

In [50]:
import copy

def create_canton_topos(originalTopoJSON):
    canton = []
    for geometry in originalTopoJSON["objects"]["cantons"]["geometries"]:
        tmp_topo = copy.deepcopy(originalTopoJSON)
        tmp_topo["objects"]["cantons"]["geometries"]=[geometry]
        
        canton.append(tmp_topo)
              
    return canton

cantons_topo = create_canton_topos(topo_data)

In [51]:
import branca 

def add_range_coeff_var(coeff):
    '''
    Adds explicit values for the variation coefficients
    '''
    if coeff == 'A':
        return ' '.join([coeff, '(0 to 1%)'])
    if coeff == 'B':
        return ' '.join([coeff, '(1.1 to 2%)'])
    if coeff == 'C':
        return ' '.join([coeff, '(2.1 to 5%)'])
    if coeff == 'D':
        return ' '.join([coeff, '(5.1 to 10%)'])
    return coeff

def add_popups(m, rate_name, column):
    '''
    Adds popups to each canton for map m.
    rate_name specifies what is the unemployment rate
    column specifies the column whose values will be shown in the popup 
    '''
    for canton in cantons_topo:
        canton_name = canton["objects"]["cantons"]["geometries"][0]['properties']['name'];
        coeff_column = ('Septembre 2017','Coefficients de variation')
        canton_data = cantons_data[cantons_data['Canton'] == canton_name]
    
        html = '''
            <h3>{}</h3>
            <p> {}: {} </p>
            <p> Coefficient of variation: {}</p>
            
        '''.format(canton_name, 
                   rate_name,
                   canton_data[column].values[0], 
                   add_range_coeff_var(canton_data[coeff_column].values[0]))
        
        iframe = branca.element.IFrame(html=html, width=300, height=150)
        popup = folium.Popup(iframe, max_width=2650)
        
        tj = folium.TopoJson(canton, 
                   'objects.cantons',
                   name=canton_name)
        
        tj.add_child(popup)
        tj.add_to(m)
        

We can now easily create the map corresponding to the unemployment rate computed taking into consideration jobseekers, employed or unemployed.

In [52]:
m_jobseekers = folium.Map(
    location= swiss_coord,
    tiles='Mapbox Bright',
    zoom_start=8
)

add_choropleth_layer(m=m_jobseekers, 
                     column=('Septembre 2017', 'Taux demandeurs d emploi'),
                     fill_color='YlOrRd',
                     layer_name='jobseekers',
                     legend_name='% of jobseekers out of active population in month september',
                     scale='fixed',
                     fixed_thresholds=legend_scale
                    )
column = ('Septembre 2017', 'Taux demandeurs d emploi') 
add_popups(m_jobseekers, 
           'Jobseekers rate for month September',
           column)

m_jobseekers.save('swiss_jobseekers_september_2017.html')

In [53]:
from IPython.display import IFrame
IFrame(src="swiss_jobseekers_september_2017.html",width=900,height=800)

An interpretation just as valid but which could be considered a bit more rigurous and on point would be to take into consideration only the people that are unemployed. This would be more intuitive given that we're looking at the unemployment rate.

We proceed to plot this rate having the classes the same as in the previous plot.

In [54]:
m_unemployed = folium.Map(
    location= swiss_coord,
    tiles='Mapbox Bright',
    zoom_start=8
)

add_choropleth_layer(m=m_unemployed,
                     column=('Septembre 2017', 'Taux de chômage'),
                     fill_color='YlOrRd',
                     layer_name='unemployed jobseeker',
                     legend_name='% of unemployed jobseekers out of active population in month september',
                     scale='fixed',
                     fixed_thresholds=legend_scale
                    )


column = ('Septembre 2017', 'Taux de chômage') 
add_popups(m_unemployed, 
           'Unemployed jobseekers rate for month September',
           column)

m_unemployed.save('swiss_unemployed_jobseekers_september_2017.html')

In [55]:
from IPython.display import IFrame
IFrame(src="swiss_unemployed_jobseekers_september_2017.html",width=900,height=800)

## Importance of scale

For a better visualization we would want both of the two unemployment rates plotted on the same map. 

One way to show the differences would be to have the same scale, given that we're presenting the unemployment rate in both maps. A reason for this would be that the colors in the choropleth map correspond to the same classes for both visualizations, making it easy to see the differences in the way the unemployment rate is computed.

But there is a perception problem with a fixed scale for two different rates-computing methods: as one of them takes into consideration more people(not only unemployed jobseekers) but the denominator is the same, the values are shifted by a certain amount to one side of the scale given the impresion of multiple extreme values.

An important decision to be made here are the thresholds in the fixed scale as it drastically changes the message sent by the map. For this, we adopt the scale used in the amstat visualization [0,2,4.1], which we extend by adding two more classes: [0,2,4.1,6,10]

To better view this, we create two choropleth layers and we add a layer control to the map

In [56]:
m = folium.Map(
    location= swiss_coord,
    tiles='Mapbox Bright',
    zoom_start=8
)

add_choropleth_layer(m=m,
                     column=('Septembre 2017', 'Taux de chômage'),
                     fill_color='YlOrRd',
                     layer_name='unemployed jobseeker',
                     legend_name='% of unemployed jobseekers out of active population in month september',
                     scale='fixed',
                     fixed_thresholds=legend_scale
                    )

add_choropleth_layer(m=m, 
                     column=('Septembre 2017', 'Taux demandeurs d emploi'),
                     fill_color='YlOrRd',
                     layer_name='all jobseekers',
                     legend_name='% of jobseekers out of active population in month september',
                     scale='fixed',
                     fixed_thresholds=legend_scale
                    )

folium.LayerControl().add_to(m)

m.save('swiss_2_choropleths_fixed_scale.html')

In [57]:
from IPython.display import IFrame
IFrame(src="swiss_2_choropleths_fixed_scale.html",width=900,height=800)

We can observe the difference between the two ways of computing the unemployment rate by adding/removing the choropleth layer corresponding to each of them. Clearly, the message sent by the two visualization is different in this setup: the unemployment rate that takes into consideration the employed jobseekers has bigger values for the cantons which are mapped to more intense colors making the viewer have a stronger reaction.

Another way of plotting the results would be to keep the colors relative to the computing method of the unemployment rate. For this, we will use the Natural Breaks (Jenks) classification because it arranges each grouping so there is less variation. This aligns with the intuition that cantons colored the same are similar in regards to unemployment rate.

As we can see below, this does not change the visual impact by much between the two maps. The strong drawback is that the viewer would have to check the scale to realise that the colors correspond to different values, which is a not something one would normally expect.

In [58]:
m = folium.Map(
    location= swiss_coord,
    tiles='Mapbox Bright',
    zoom_start=8
)

add_choropleth_layer(m=m,
                     column=('Septembre 2017', 'Taux de chômage'),
                     fill_color='YlOrRd',
                     layer_name='unemployed jobseeker',
                     legend_name='% of unemployed jobseekers out of active population in month september',
                     scale='jenks',
                     nb_classes=3
                    )

add_choropleth_layer(m=m, 
                     column=('Septembre 2017', 'Taux demandeurs d emploi'),
                     fill_color='YlOrRd',
                     layer_name='jobseekers',
                     legend_name='% of jobseekers out of active population in month september',
                     scale='jenks',
                     nb_classes=3
                    )

folium.LayerControl().add_to(m)

m.save('swiss_2_choropleths_jenks_scale.html')

In [59]:
from IPython.display import IFrame
IFrame(src="swiss_2_choropleths_jenks_scale.html",width=900,height=800)

An interesting detail we're noticing when looking at a choropleth map in general is consisted by the elements at the extremes of the legend. 
In the above map, what happens when we take into consideration also the employed jobseekers, is that the canton of Geneva is no longer the one with the biggest unemployment rate, but rather Neuchatel, which puts them in two very different perspectives.

In conclusion, this is a clear example of how the visualization can influence the viewer's perspective towards some desired form by controling the amounts of details made available.

## Task 3 - Swiss and Foreign workers statistics - discrepant or logical?

---
<div class="alert alert-block alert-info">
<b>Task</b>: 
<ol>
    <li>Use the <a href="https://www.amstat.ch">amstat</a> website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between <i>Swiss</i> and <i>foreign</i> workers.
    </li>
    <li>
The Economic Secretary (SECO) releases <a href="https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html">a monthly report</a> on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for <i>foreign</i> (5.1%) and <i>Swiss</i> (2.2%) workers.
    </li>
    <li>
Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (<i>hint</i> The easy way is to show two separate maps, but can you think of something better ?). Where are the differences most visible ? Why do you think that is ?
    </li>
    <li>Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.
    </li>
</ol>
</div>

---

### Data retrieval and exploration

One of the main challenges of this task is finding appropriate data and navigating through the website. Since we have no knowledge of national languages, we have used the translation services. This may impact the quality of the data we could possibly obtain for several reasons:
* we might not understand the semantics of data
* we might misinterpret the explanations
* we might miss the data that is obviously present

In this part we will mostly use the latest data - for September 2017. We could opt to aggregate data to find mean values, but we think such aggregation over several months would not bring new information unless done specifically, for example to see the mean values for different seasons such as spring, summer, fall or winter. We will load a year worth of data and will try utilizing most of it in several approaches, such as with interactive selection of the month. Unfortunately, in practical terms this approach of interactivity through widgets proved hard to make portable which will be described in detail.

#### Navigating the Amstat website

For this task we collect the data on unemployment based on nationality with cantonal distribution. Raw data is available under *Details*, where needed data could be found, depending on the criteria.

![amstat landing page](images/amstat_landing.png)

As we need the unemployment rate, we navigate to the specific page, where we specify the criteria for selecting the data.

##### Unemployment rate per nationality 

We extracted the data for unemploymet rate *for each canton* with the aggregation criteria based on the *nationality*, where we are presented with the rates given separately for Swiss and Foreign nationals. 

We have exported the data as an **Excel** spreadsheet and we have specified the time span of 1 year: from October 2016 until September 2017. We will rely mostly on data from September 2017, but with well parametrized visualizations display for different month is only a change in a single parameter of a function. In this part we have opted for an Excel spreadsheet instead of CSV file for providing superior encoding support. Since Pandas supports both formats, we have opted for the one with easier preprocessing.

Data is available in: `data\Unemployment_Rate_Nationality-1year.xlsx`.

##### Unemployment rate per age

Similarly, we have exported the unemployment rate data for each canton based on the *age* criteria instead of nationality. The time span is similarly one year for possible further analytics without any need for data recollection. Age criterion selected aggregates the data based on 3 age intervals:
- 15-24 years
- 25-49 years
- 50+ years

Other option would be using the 5 year intervals, which do provide better granularity. For this analysis it might be more suitable to use such 3 general groups, roughly with meaning of *young*, *mid-age* and *senior* workers and providing more insights for such population groups. 

Data is available in: `data\Unemployment_Rate_Age-1year.xlsx`. 

##### Unemployment rate per age and nationality

In the previos two data collection tasks we have collected data which is disjoint in terms of criteria: **either age or nationality**. This is the limitation of the specific platform and page, since we could select only one:

![one criterion](images/amstat_one_criterion.png)

One of the solutions would be to infer such data from the total cantonal population or other datasets available on *amstat*. We have opted not to infer, since we do not have a complete understanding of the data and methodologies used. Indeed some descriptions exist, but when using translation services, meaning is commonly lost. Other possible issue would be using a wrong methodology or different data, such as external source for cantonal population, which may differ from the figures used in specific data provided. **It was an imperative to find the data on amstat!**

By navigating the website, a possible misconfiguration of page or security measures allows us to access the [server data](https://www.amstat.ch/MicroStrategy/servlet/mstrWeb). This way we could see the data and options both present on the website, but also options that are not present (or the ones we did not manage to find)! We will present our findings along with the screenshots for easier comprehension:

###### Server landing page

We believe this might be a misconfiguration of the website that we are allowed to see this (Test and Production server data). We choose to explore the Production data. Luckily the data is read only and mostly similar to content which is usable from the website.

![server 0](images/amstat_server.png)

###### Production data

Next we select the general search data.
![server 1](images/amstat_server1.png)

###### Detailed search

We proceed by going into the detailed search to try to find the option for getting the data with more general criteria.
![server 2](images/amstat_server2.png)

###### Monthly data on unemployment

We select the option to see the monthly data reports. There are other options explored, but this has yielded the final result.
![server 3](images/amstat_server3.png)

###### Monthly data on unemployment rate

Luckily, we can find the data on unemployment rate! We further explore the form, it differs from the previously seen form where we could choose only one criterion. We can set up our filter with deeper detail and granularity. 

![server 4](images/amstat_server4.png)

###### Profit!

The most important feature here is that we could select multiple criteria! To get the data we select both **age AND nationality**. We finally export the data as an Excel spreadsheet to use in the final part of the analysis.

![server 5](images/amstat_server5.png)

Raw data which aggregates unemployment rates per nationality and age is available at: `data\Unemployment_Rate-Age+Nationality.xlsx`.

---

### Final obtained data - short revision:

We have obtained following data from the *amstat* page:
* `data\Unemployment_Rate_Nationality-1year.xlsx`: unemployment rate, per canton, by nationality in the period Oct 2016-Sep 2017
* `data\Unemployment_Rate_Age-1year.xlsx`: unemployment rate, per canton, by age in the period Oct 2016-Sep 2017
* `data\Unemployment_Rate-Age+Nationality.xlsx`: unemployment rate, per canton, by nationality and age in the period Oct 2016-Sep 2017

Since this data is from the single source, using the same methodology we can conclude that we have not included any error by manual inference or calculation using external data or unverified methodologies.

---

### Visualizing the findings - let's map!

As we have obtained the data, an intuitive way to present the geographicaly split data, in this case cantonal data, is to use the map. The audience has a better understanding of the geographical and spatial distribution of data and we could make further conclusions based on geopolitical factors.

We will go through the visualization process step by step, explain the difficulties we have encountered as well as try to provide context for the actual data and its analysis.

In [60]:
import pandas as pd
import folium
import json
import seaborn as sns
import branca
import bokeh
import datetime

from bokeh.embed import file_html
from bokeh.resources import CDN
from bokeh.charts import Donut, show, output_file, Scatter, Bar
from bokeh.sampledata.olympics2014 import data
from IPython.display import IFrame, HTML

import vincent
vincent.core.initialize_notebook()

import matplotlib.pyplot as pl
%matplotlib inline

The bokeh.charts API has moved to a separate 'bkcharts' package.

This compatibility shim will remain until Bokeh 1.0 is released.
After that, if you want to use this API you will have to install
the bkcharts package explicitly.

  warn(message)


We will use *Pandas* for data manipulation; *Folium* and *Branca* for creating maps; *Seaborn*, *Bokeh*, *Vincent* and *Matplotlib* for making graphs. 

**To preserve compatability with different versions, especially with folium, we use bokeh.charts as is without additional installations, thus the warning message.**

We define the paths to the previously obtained files.

In [61]:
UNEMPLOYMENT_RATE_BY_NATIONALITY = 'data/Unemployment_Rate_Nationality-1year.xlsx'
UNEMPLOYMENT_RATE_BY_AGE = 'data/Unemployment_Rate_Age-1year.xlsx'
UNEMPLOYMENT_RATE_COMBINED = 'data/Unemployment_Rate-Age+Nationality.xlsx'

### Data processing

After obtaining the necessary files from *amstat*, we need to process them to a usable `DataFrame`. We make a generic function for processing data, in simple terms we always need to drop certain columns, rename some columns to more suitable name or set a new index

In [62]:
'''
This function is used to load and clean the .xlsx data on unemployment rates in a generalized manner.
Function returns the dataframe ready for use and analysis, with specified index and pruned data.

Parameters:
path -- path to the excel file containing the necessary data
cols_to_drop -- specify the columns to drop, a list of column names
rename_pair -- specify the key-value (dictionary) pair to rename the column
new_index -- specify how to index the data, a list of column names for the new index
drop_last -- boolean, specify whether to drop the last value, which is usually a 'Total' value and not cantonal value

Returns:
df -- processed dataframe
'''
def get_dataframe_rate(path, cols_to_drop, rename_pair, new_index, drop_last=True):
    df = pd.read_excel(path, convert_float=False)
    df.drop(cols_to_drop, axis=1, inplace=True) # drop the unnecesary data
    if drop_last: # drop the last row, usually representing the total
        df.drop([0,len(df)-1], axis=0, inplace=True) # drop total values, we need cantonal values only
    else:
        df.drop([0], axis=0, inplace=True) # no total values present, we drop only header text
    df.rename(columns=rename_pair, inplace=True)
    #df.set_index(new_index, inplace=True) # We opted not to set the multiindex, since for the map we would need to reset it
    
    return df

Using the previously defined generic function, we instantiate functions for processing each file to a `DataFrame` in a proper manner.

In [63]:
'''
This function is a shortcut to get the pruned dataframe for the unemployment rate by nationality.
'''
def get_dataframe_rate_nationality(path=UNEMPLOYMENT_RATE_BY_NATIONALITY):
    return get_dataframe_rate(path, ['Mois'], {"Nationalité":'Nationality'},['Canton', 'Nationality'])

'''
This function is a shortcut to get the pruned dataframe for the unemployment rate by age.
'''
def get_dataframe_rate_age(path=UNEMPLOYMENT_RATE_BY_AGE):
    return get_dataframe_rate(path, ['Mois', 'Unnamed: 2'], {"Classes d'âge 15-24, 25-49, 50 ans et plus":'Age category'}, ['Canton', 'Age category'])

'''
This function is a shortcut to get the pruned dataframe for the unemployment rate by composite data with age and nationality.
'''
def get_dataframe_rate_combined(path=UNEMPLOYMENT_RATE_COMBINED):
    return get_dataframe_rate(path, ['Unnamed: 3', 'Monat'], {"Altersklassen 15-24, 25-49, 50 und mehr":'Age category', 
                                                              "Nationalität":'Nationality', "Kanton":'Canton'},
                                       ['Canton', 'Nationality', 'Age category'], drop_last=False)

Finally we instantiate the necessary `DataFrames`:
* u_rate_nationality - unemployment rate per nationality (foreign or Swiss)
* u_rate_age - unemployment rate per age (15-24, 25-49, 50+)
* u_rate_combined - unemployment rate with both nationality and age distribution per such group

In [64]:
u_rate_nationality = get_dataframe_rate_nationality()
u_rate_age = get_dataframe_rate_age()
u_rate_combined = get_dataframe_rate_combined()

As we have obtained all the necessary data, we are ready to explore the geographical data through visualization on maps!

## Let us explore through visualization

Since data contains geographical distribution of data over cantons, the best way for visualizing such data is to place such data on a map. The position of each canton and the geo(political) factors such as neighboring different coutries or certain geographical features may provide worthy in the analysis.

#### GeoJSON vs TopoJSON
We are provided with TopoJSON files, which is a variant of GeoJSON. In GeoJSON each object to represent on map had separately listed coordinates which served as an input to the specified method of drawing (e.g. a line or polygon). TopoJSON utilizes the same paradigm, but it uses a compression strategy: instead of specifying coordinates for each object, we construct a dictionary of coordinates. This way we avoid repetition and reduce the total size of a GeoJSON file by accessing a single dictionary.

Nevertheless, *Folium* uses GeoJSON as a primary source for displaying data and certain preprocessing steps are necessary to achieve the same result!

We load the Swiss TopoJSON object, representing the outline of each canton:

In [65]:
ch_topo = json.load(open('topojson/ch-cantons.topojson.json'))

#### Unifying TopoJSON and statistical data identifier

Before we show data on the map, we need to have corresponding identifiers both in statistical data and in the TopoJSON data. One logical option was to use the canton abbreviations. Since in our data both German and French names occur, depending on the downloaded data (the file combining nationality and age was available only in German), we need to establish a mapping between such names and the canton code.

First, we extract the canton code (`id`) from the TopoJSON file:

In [66]:
canton_id = [canton['id'] for canton in ch_topo['objects']['cantons']['geometries']]

Next we proceed to make a dictionary based on the obtained canton code and the full canton name present in the statistical data. It is important to mention that the ordering of the cantons in statistical data and the TopoJSON file is the same intrinsically.

In [67]:
canton_id_name_fr = zip(canton_id, u_rate_nationality.reset_index()['Canton'].drop_duplicates())
canton_id_name_de = zip(canton_id, u_rate_combined.reset_index()['Canton'].drop_duplicates())

Finally we establish a dictionary suitable for replacing the values in column named *Canton* in the `DataFrame` with canton codes. We will use this dictionary in the `replace` function of each `DataFrame`.

In [68]:
cantons_pairs_fr = {'Canton':{pair[1]: pair[0] for pair in canton_id_name_fr}}
cantons_pairs_de = {'Canton':{pair[1]: pair[0] for pair in canton_id_name_de}}

For a sanity check, we manually inspect one such dictionary to see if data pairings are correct:

In [69]:
cantons_pairs_de

{'Canton': {'Aargau': 'AG',
  'Appenzell Ausserrhoden': 'AR',
  'Appenzell Innerrhoden': 'AI',
  'Basel-Landschaft': 'BL',
  'Basel-Stadt': 'BS',
  'Bern': 'BE',
  'Freiburg': 'FR',
  'Genf': 'GE',
  'Glarus': 'GL',
  'Graubünden': 'GR',
  'Jura': 'JU',
  'Luzern': 'LU',
  'Neuenburg': 'NE',
  'Nidwalden': 'NW',
  'Obwalden': 'OW',
  'Schaffhausen': 'SH',
  'Schwyz': 'SZ',
  'Solothurn': 'SO',
  'St. Gallen': 'SG',
  'Tessin': 'TI',
  'Thurgau': 'TG',
  'Uri': 'UR',
  'Waadt': 'VD',
  'Wallis': 'VS',
  'Zug': 'ZG',
  'Zürich': 'ZH'}}

We define a helper function to replace the full canton names with their code:

In [70]:
def replace_canton_with_id(dataframe, canton_pairs):
    replaced_df = dataframe.reset_index().replace(to_replace=canton_pairs)
    return replaced_df

By using the previously defined dictionary and function, we perform the final modification of the data and replace the mentioned values.

In [71]:
u_rate_age = replace_canton_with_id(u_rate_age, cantons_pairs_fr)
u_rate_nationality = replace_canton_with_id(u_rate_nationality, cantons_pairs_fr)
u_rate_combined = replace_canton_with_id(u_rate_combined, cantons_pairs_de)

### Let's map!

As mentioned several times before, we will use the *Folium* package, version *0.5.0*. Folium package is a Python wrapper around the very popular `JavaScript` *Leaflet* library for maps. Unfortunately, in different versions there is varying compatability support and the documentation is sometimes misleading, so additional care had to be taken when implementing more exotic aspects of visualization, as it will be shown later on.

For the start, we will show a map of Switzerland (*m_switzerland*), we set up the zoom level to the appropriate value and test the longtitute and lattitude values to see that Switzerland fits nicely in window. We decided on using `Mapbox Bright` overlay since it leaves only the most important geographical features, mainly because we will use **choropleth** overlay as the main feature later on.

In [72]:
m_switzerland = folium.Map([46.8,8.3], tiles='Mapbox Bright', zoom_start=8)
m_switzerland

### Displaying data on Swiss and foreigners unemployment rate

As mentioned, we could use two separate maps, but it would be more difficult to notice some differences with respect to the geographical distribution. We will use two separate layers, one for Swiss and one for foreign citizens and the respective unemployment rate. 

Since the scale of data differs, we have chosen a different color scheme for each for more intuitive difference display. The opacity has been set at 0.5 for both layers because we use different color schemas. Had we used the same fill color for both an appropriate thing to do would be to manually adjust the fill_opacity to suit the scale. This would be necessary to achieve a proper fill when both layers are displayed.

In [73]:
# we add a layer for the Swiss unemployment rate
m_switzerland.choropleth(geo_data=ch_topo,
                         name="Suisses", 
                         topojson='objects.cantons',
                         data = u_rate_nationality[u_rate_nationality.Nationality=='Suisses'],
                         columns = ['Canton','Septembre 2017'],
                         key_on='feature.id',
                         fill_color='YlGnBu', 
                         fill_opacity=0.5, 
                         line_opacity=0.4,
                         legend_name="Unemployment Rate (%) - Swiss")

# we add a layer for the foreigners unemployment rate
m_switzerland.choropleth(geo_data=ch_topo,
                         name="Etrangers", 
                         topojson='objects.cantons',
                         data = u_rate_nationality[u_rate_nationality.Nationality=='Etrangers'],
                         columns = ['Canton','Septembre 2017'],
                         key_on='feature.id',
                         fill_color='YlOrRd', 
                         fill_opacity=0.5, 
                         line_opacity=0.4,
                         legend_name="Unemployment Rate (%) - Foreigners")

# we add layer control to be able to control the display of the layers
folium.LayerControl().add_to(m_switzerland)
m_switzerland

#folium.Map.save(m_switzerland, "map-unemployment.html") # Optionally save the map for later display as html

We can notice the trend in higher unemployment in both foreign and swiss nationals in the western parts of the country, bordering France. Such rates are significantly lower in the easter parts of Switzerland, especially ones bordering both Italy and Austria.

Observed separately, the rates of foreign and swiss unemployment rates have similar trend over cantons, which is also visible when displaying both layers.

What raises a question is why are foreigners less employed. Some insight might be shown with the methodology how the percentage is calculated. 

>The unemployment rate is defined as the ratio of number of *unemployed, registered jobseekers* divided by the number of *total active population*.

When dividing this by foreign and swiss nationals, we see that there are more *unemployed, registered jobseekers* of foreign nationality. Albeit, the percentage is calculated with respect to the total population of foreingers. Therefore, if we have a lower population and more unemployed foreingers the rate will be higher. Considering the employment policy of Switzerland where Swiss workers are more protected and preferred than foreign workers, we can see the result of such policy. There might be other internal policies or explanations we are not aware of since we are not Swiss. 

The conclusion additionally supports the fact that we see a consistent corellation of unemployment by nationality over cantons.

#### [Failed] Attempt on interactive overview of data

Since we have displayed the data only for September 2017, we are interested in being able to display monthly data for different dates. There is limited support for sliders for *Folium*, mainly reserved for *Heatmaps*. We have tried implementing a slider widget.

In [74]:
import ipywidgets as widgets

In [75]:
# We add a month mapping between english abbreviations and month names in french
english2french_month = {
    'Oct':'Octobre',
    'Nov':'Novembre',
    'Dec':'Décembre',
    'Jan':'Janvier',
    'Feb':'Février',
    'Mar':'Mars',
    'Apr':'Avril',
    'May':'Mai',
    'Jun':'Juin',
    'Jul':'Juillet',
    'Aug':'Août',
    'Sep':'Septembre'}

In [76]:
'''
We define a function for showind the choropleth map of foreign and swiss unemployment rate, by passing the desired month and year.

date -- datetime in '%b-%Y' format (short month-full year, e.g. Sep-2017)
'''
def show_data(date):
    # we parse the given date
    month = date.strftime('%b')
    year = date.strftime('%Y')
    # and convert the date for display
    str_date = english2french_month[month]+' '+year
    
    # we construct a new map
    m_switzerland = folium.Map([46.8,8.3], tiles='Mapbox Bright', zoom_start=8)
    
    # we add the layers representing swiss and foreing unemployment rate
    m_switzerland.choropleth(geo_data=ch_topo, 
                         name="Suisses", 
                         topojson='objects.cantons',
                         data = u_rate_nationality[u_rate_nationality.Nationality=='Suisses'],
                         columns = ['Canton',str_date],
                         key_on='feature.id',
                         fill_color='YlGnBu', 
                         fill_opacity=0.5, 
                         line_opacity=0.2,
                         legend_name="Unemployment Rate (%) - Swiss"
                        )

    m_switzerland.choropleth(geo_data=ch_topo, 
                         name="Etrangers", 
                         topojson='objects.cantons',
                         data = u_rate_nationality[u_rate_nationality.Nationality=='Etrangers'],
                         columns = ['Canton',str_date],
                         key_on='feature.id',
                         fill_color='YlOrRd', 
                         fill_opacity=0.5, 
                         line_opacity=0.2,
                         legend_name="Unemployment Rate (%) - Foreign"
                        )


    folium.LayerControl().add_to(m_switzerland)
    
    # display the content on function call
    display(HTML('<h3>'+str_date+'</h3>'))
    display(m_switzerland)

In [77]:
# we define a wrapper for calling the show_data function from the widget
def f(x):
    show_data(x)

In [78]:
# we define a list of dates for wich we have data (October 2016 - September 2017)
dates = [datetime.date(2016,i,1) for i in range(10,13)]
dates += [datetime.date(2017,i,1) for i in range(1,10)]

In [79]:
# we convert the possible dates to string format suitable for the widget
options = [(i.strftime('%b-%Y'), i) for i in dates]

We opted to try using the selection slider to select the desired month.

In [80]:
w = widgets.SelectionSlider(
    options=options,
    description='Select month',
    disabled = False,
    readout = True,
    continuous_update=False
)

In [81]:
widgets.interact(f, x=w)

<function __main__.f>

**Unfortunately, interactive widgets do not display well in GitHub or nbviewer. Screenshot is provided to see the idea behind the functionality when the local notebook is running:** 

![interactive](images\interactive.png)

---

This way we would be able to browse through historical data to try to see different trends.


### Adding age data to the map

As mentioned earlier, we have managed to find the dataset containing the unemployment rate by nationality additionally split across 3 different age groups. Since this data is simply too much to show on a map, we will show the cantonal data as a popup.

To make a separate popup for each canton is not too difficult if using GeoJSON. For TopoJSON we need to devise a method to extract a single object (canton manually). TopoJSON [documentation](https://github.com/topojson/topojson/wiki/Introduction) has proven invaluable at this point.

In [82]:
import copy

# a function to extract TopoJSON feature for each canton. Returns a list of TopoJSON objects.
def create_canton_topos(originalTopoJSON):
    canton = []
    for geometry in originalTopoJSON["objects"]["cantons"]["geometries"]:
        tmp_topo = copy.deepcopy(originalTopoJSON)
        tmp_topo["objects"]["cantons"]["geometries"]=[geometry]
        
        canton.append(tmp_topo)
        
        
    return canton

We separate the TopoJSON into 26 TopoJSON descriptors for each canton:

In [83]:
canton_topos = create_canton_topos(ch_topo)

We define generic functions for creating `Vincent Vega` graphs for piecharts of *Age* and *Nationality*. Unfortunately, this data has not been used in the final run, instead a **grouped bar chart** has been implemented. The implementations stay since it would be easy to change the desired visualization by simply invoking one of the functions with desired canton and date.

In [84]:
# returns a dictionary object for displaying on Folium map - a pie chart of unemployment rate by age
def make_pie_age(canton, date):
    d = {}
    cnt = 0
    keys = ['15-24','25-49','50+']
    
    for el in u_rate_age[u_rate_age.Canton==canton][date]:
        d[keys[cnt]] = el
        cnt += 1
    
    pie = vincent.Pie(d, width=100, height=100)
    pie.legend('Age range')
    pie_json = pie.to_json()
    pie_dict = json.loads(pie_json)
    
    return pie_dict

# returns a dictionary object for displaying on Folium map - a pie chart of unemployment rate by nationality
def make_pie_nationality(canton, date):
    d = {}
    cnt = 0
    keys = ['Foreign','Swiss']
    
    for el in u_rate_nationality[u_rate_nationality.Canton==canton][date]:
        d[keys[cnt]] = el
        cnt += 1
        
    pie = vincent.Pie(d, width=100, height=100)
    pie.legend('Nationality')
    pie_json = pie.to_json()
    pie_dict = json.loads(pie_json)
    
    return pie_dict

Since there was not an easy way to display multiple `Vincent Vega` graphs in a `Popup`, we have decided to use a single **grouped barchart** to aggregate such data. The function is generic in sense of generating a graph for selected canton and date.

In [85]:
# A function which returns a Vincent object for displaying a grouped barchart, data is selected based on canton and date
def make_grouped_bar_combined(canton, date):
    foreigners = {}
    swiss = {}
        
    cnt = 0
    keys = ['15-24','25-49','50+']
    
    df_canton = u_rate_combined[u_rate_combined.Canton==canton].replace('...',-1)
    df_foreign = df_canton[df_canton.Nationality=='Ausländer']
    df_swiss = df_canton[df_canton.Nationality=='Schweizer']
  
    for el in df_foreign[date]:
        if(el!=-1):
            foreigners[keys[cnt]] = el
        cnt += 1
        
    cnt = 0
    for el in df_swiss[date]:
        if(el!=-1):
            swiss[keys[cnt]] = el
        cnt += 1
        
    data = [foreigners, swiss]
    index = ['Foreigners', 'Swiss']
    
    bar = vincent.GroupedBar(pd.DataFrame(data, index=index))
    bar.legend(title='Unemployment rate by nationality and age')
    bar.axis_titles(x='Nationality', y='Unemployment rate')
    bar.common_axis_properties(title_size=10)
    bar.width = 250
    bar.height = 200
    
    bar_json = bar.to_json()
    bar_dict = json.loads(bar_json)
    #bar.display()
    return bar_dict

#### Displaying a map with popup data

We now generate and display the data with more detailed insights of age distribution over different nationality unemployment rate visible as a popup. By clicking on each canton we see the cantonal data for September 2017. Since the generated map is too big for notebook to render as an object, we save the map to an HTML file, and then load it to display.

In [86]:
# we create an empty map
m_switzerland = folium.Map([46.8,8.3], tiles='Mapbox Bright', zoom_start=8)

# we iterate through every separate cantonal TopoJSON
for canton in canton_topos:
    
    name = canton["objects"]["cantons"]["geometries"][0]['id'];
    
    tj = folium.TopoJson(canton, 
               'objects.cantons',
               name=name)
    
    # we create a grouped barchart for the current canton
    v_bar_combined = folium.Vega(make_grouped_bar_combined(name,'September 2017'), width=500, height=250)
    
    # we create and add a popup for each canton
    popup = folium.Popup(max_width=500)
    popup.add_child(v_bar_combined)
    
    tj.add_child(popup)
    tj.add_to(m_switzerland)

In [87]:
# we save the map to an html file
folium.Map.save(m_switzerland, "map-vincent.html")

We load the map as an IFrame, since browsers can't display HTML page in HTML page otherwise, since when exporting a map a standalone HTML page is created. Do click on cantons to see the detailed distribution of age per canton!

#### Interactive map - click on the canton to see the age distribution and nationality 

When data is not available for certain age group, bar is represented as 0. It is the feature of the data we can't affect. By using popups we have managed to preserve the valuable geographical representation, while adding extra information without overwhelming the user.

In [88]:
IFrame(src="map-vincent.html",width=900,height=800)

### We are not satisfied yet!

We would still able to see 3 parameters in the popup:
* total age ratios as a piechart
* total nationality ratio as a piechart
* unemployment rate grouped by age and nationality as a barchart

We had to go deeper into darker realms of coding to manage this. We need to use bokeh to create the graphs. We merge all the graphs into one HTML and show them in a single popup.

In [89]:
# we define a generic function for creating a bokeh piechart for unemployment rate by nationality.
# canton and date is passed as a parameters, while bokeh Donut (piechart) is returned
def bokeh_pie_nationality(canton, date):
    df_canton = u_rate_nationality[u_rate_nationality.Canton==canton]
    
    d = Donut(df_canton, values=date, label=['Nationality'], text_font_size='12pt', hover_text='Unemployment by nationality',
             height = 220, width=220)
    
    d.toolbar.disabled = True
    d.toolbar.logo = None
    d.toolbar_location = None
    d.title.text = "Unemployment (%) by nationality"
    
    return d

In [90]:
# we define a generic function for creating a bokeh piechart for unemployment rate by age.
# canton and date is passed as a parameters, while bokeh Donut (piechart) is returned
def bokeh_pie_age(canton, date):
    df_canton = u_rate_age[u_rate_age.Canton==canton]
    
    d = Donut(df_canton.replace({'Age category': {1.0: '15-24', 2.0: '25-49', 3.0:'50+'}}), values=date, label=['Age category'], text_font_size='12pt', hover_text='Unemployment by age',
             height = 220, width=220)
    
    d.toolbar.disabled = True
    d.toolbar.logo = None
    d.toolbar_location = None
    d.title.text = "Unemployment (%) by age"
    
    return d

In [91]:
# we define a generic function for creating a bokeh grouped barchart for unemployment rate by age and nationality.
# canton and date is passed as a parameters, while bokeh Bar (barchart) is returned
def bokeh_bar_combined(canton, date):
    df_canton = u_rate_combined[u_rate_combined.Canton==canton].replace('...',0)
    
    d = bokeh.charts.Bar(df_canton.replace({'Age category': {1.0: '15-24', 2.0: '25-49', 3.0:'50+'}}), values=date,
                         label=['Age category'], group=['Nationality'], legend='top_right')
    
    d.toolbar.disabled = True
    d.toolbar.logo = None
    d.toolbar_location = None
    d.title.text = "Unemployment % by age category, nationality"
    d.yaxis.axis_label = "Unemployment rate [%]"
    d.axis.axis_label_text_font_size = '12pt'
    d.title.align = 'center'
    d.title.text_font_size = '12pt'
    d.xaxis.major_label_text_font_size = '12pt'
    d.height = 350
    d.width = 440
    
    return d
    

### The magic starts here!

Since the documentation and the compatability is very limited for this specific use-case, a lot of testing and errors have been made. Finally, we have managed to create a desired map!

In [92]:
# we create an empty map
m_switzerland = folium.Map([46.8,8.3], tiles='Mapbox Bright', zoom_start=8)

cnt = 0

# we iterate through each canton TopoJSON object to add a popup
for canton in canton_topos:
    
    cnt += 1
    
    name = canton["objects"]["cantons"]["geometries"][0]['id'];
    
    tj = folium.TopoJson(canton, 
               'objects.cantons',
               name=name)
    
    # we generate a bokeh piechart for age, and then extract the HTML data
    v_pie_age = bokeh_pie_nationality(name, 'Septembre 2017')
    html_age = file_html(v_pie_age, CDN, 'age'+name+str(cnt))
    
    # we generate a bokeh piechart for nationality, and then extract the HTML data
    v_pie_nationality = bokeh_pie_age(name, 'Septembre 2017')
    html_nationality = file_html(v_pie_nationality, CDN, 'nationality'+name+str(cnt))
    
    # we generate a bokeh barchart for combined criteria, and then extract the HTML data
    v_bar_grouped = bokeh_bar_combined(name, 'September 2017')
    html_grouped = file_html(v_bar_grouped, CDN, 'grouped'+name+str(cnt))
    
    # we make an IFrame of every such generated HTML, for future integration in the popup
    age = branca.element.IFrame(html=html_age, width=100, height=100)
    nationality = branca.element.IFrame(html=html_nationality, width=100, height=100)
    grouped = branca.element.IFrame(html=html_grouped, width=200, height=100)
    
    # finally, we make an HTML snippet which combines all the previously generated HTML (IFrame) files
    combined_html = '<p style="font-family: Verdana; text-align: center;"> Statistics for canton '+name+'</p>'\
    +'<figure style="width:90; max-width:90; max-height:90; float:left;">'+html_age+'</figure>'\
    +'<figure style="width:90; max-width:90; max-height:90; float:right;">'+html_nationality+'</figure>'\
    +'<figure style="width:95; max-width:95; max-height:95; float:left; padding-top:100;">'+html_grouped+'</figure>'
    
    # we take such HTML and integrate it again as an IFrame to put in the popup
    combined = branca.element.IFrame(html=combined_html, width=450, height=380)
    
    # we create a popup with desired HTML elements
    popup = folium.Popup(combined, max_width=450)
    
    tj.add_child(popup)
    tj.add_to(m_switzerland)

folium.Map.save(m_switzerland, "map-full.html")

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='1d76fc03-ce8b-4a01-a087-7c8088708cc6', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='224a4d72-0a60-48d2-9460-88254f6c81fe', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='60b877eb-4014-4616-90a3-c0c0a3acabf5', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='a24f2190-ba79-4117-a65a-f5f32a118327', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='d4141f0b-a4c0-4105-9b57-24e71fdc2e19', ...)
E-1010 (CDSVIEW

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='180ac26c-6c48-4749-b2e8-b1969a63d953', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='49e16213-db1d-493d-9bb7-7d3c87638222', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='838f0280-08a1-46c1-8b78-c45e5aee2d38', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='a80f6695-0a54-497d-bfc2-ad03cc8a00d3', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='d01ea68d-5b4b-4798-9c0f-68a83d46c228', ...)
E-1010 (CDSVIEW

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='2b8d302b-60f0-414e-860b-42e992e0712d', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='3359e286-1be9-46e3-9513-61fb82f65680', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='8f03550e-15c8-44dc-86b7-a2afa5cf4b92', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='bd08a347-1047-4aba-ab69-8d04b88e97c7', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='e197bbf6-3600-4f14-8294-d3762696833c', ...)
E-1010 (CDSVIEW

E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='58917550-e5d9-4532-bf02-ae4ece2d7fc3', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='65181678-8bcb-4893-8046-61c13fd362d4', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='a9e05211-6667-43dc-9c66-8db67019be97', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='af0b3a79-fe11-43d0-8ea5-4c2d574ef0d5', ...)
E-1010 (CDSVIEW_SOURCE_DOESNT_MATCH): CDSView used by Glyph renderer must have a source that matches the Glyph renderer's data source: GlyphRenderer(id='f11030e9-aec1-47e8-b08e-81a53fde2703', ...)
E-1010 (CDSVIEW

Since the map object generated is simply too big because of additional HTML elements, we need to save it first, and then load it to the notebook. Otherwise Jupyter has trouble in rendering the python object because of the size.

**Do click on some canton to see the more detailed data**

In [93]:
IFrame(src="map-full.html",width=900, height=800)

Quite an effort has been done to make this kind of popup work! But it provides the most relevant information in a concise and intuitive manner!

#### References and manuals

If not mentioned in text, the complete list of used manuals and documentation is provideed here. Thanks to those resources the last map has been possible to create:

* https://python-visualization.github.io/folium/quickstart.html
* https://github.com/wrobstory/vincent
* https://altair-viz.github.io/
* http://nbviewer.jupyter.org/gist/BibMartin/4b9784461d2fa0d89353
* http://jeffpaine.github.io/geojson-topojson/
* http://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/TimeSliderChoropleth.ipynb
* http://nbviewer.jupyter.org/github/python-visualization/folium/blob/master/examples/Popups.ipynb
* http://nbviewer.jupyter.org/github/python-visualization/folium/tree/master/examples/
* http://bokeh.pydata.org/en/0.11.0/docs/user_guide/charts.html