# 03 - Interactive Viz

## Additional resources

This file provides some links to some interesting data visualization projects across the web. This list is not intented to be exhaustive, just to offer a reference for inspiration or information destined to the curious.



* [A map which takes its color scheme from images](https://www.mapbox.com/cartogram/)

* [Interesting map visualizations](http://www.viewsoftheworld.net/)

* [Dataviz Blog](https://bl.ocks.org/mbostock) by Mike Bostock, the creator of [`d3.js`](https://d3js.org/), a popular visualization tool for the web 

* [Collection of interesting visualizations](https://flowingdata.com/category/visualization/)

* [Gene explorer, for biologists](http://www.bar.utoronto.ca/GeneSlider/?datasource=CNSData&chr=1&start=3120&end=5000)

* [Exploring 100k stars (Chrome only)](https://stars.chromeexperiments.com/)

* [Interactive map of world trade over time](http://www.visualcapitalist.com/interactive-mapping-flow-international-trade/)

* [Visualizing deaths in conflicts across the world](http://www.informationisbeautiful.net/visualizations/senseless-conflict-deaths-per-hour/)

* [Where did immigrants to the US come from over time ?](http://metrocosm.com/animated-immigration-map/)

* [Listening to Wikipedia](http://listen.hatnote.com/)

* [A live map of Twitter](https://www.mapd.com/demos/tweetmap/)

* [Collection of cool visualizations](http://www.informationisbeautiful.net/)

* [A Choropleth map of Switzerland with mountains in relief](https://timogrossenbacher.ch/2016/12/beautiful-thematic-maps-with-ggplot2-only/)

* [Interactive datavizualisations of the UK](https://mappl.uk/)

* [Most used word in each state of the US (xkcd)](https://imgs.xkcd.com/comics/state_word_map.png)

* [Drone deaths in Pakistan]( http://drones.pitchinteractive.com/)

* [Map projection transitions](https://www.jasondavies.com/maps/transition/)

* [The five main projects of the Belt and Road project in China](http://multimedia.scmp.com/news/china/article/One-Belt-One-Road/index.html)

* [Full images of the Earth datastory](https://pudding.cool/2017/10/satellites/)

* [Surprise! Showing the unexpected](https://medium.com/@uwdata/surprise-maps-showing-the-unexpected-e92b67398865)

# 03 - Interactive Viz

## Deadline

Wednesday November 8th, 2017 at 11:59PM

## Important Notes

- Make sure you push on GitHub your Notebook with all the cells already evaluated
- Note that maps do not render in a standard Github environment : you should export them to HTML and link them in your notebook.
- Remember that `.csv` is not the only data format. Though they might require additional processing, some formats provide better encoding support.
- Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you plan to implement!
- Please write all your comments in English, and use meaningful variable names in your code

## Background

In this homework we will be exploring interactive visualization, which is a key ingredient of many successful data visualizations (especially when it comes to infographics).

Unemployment rates are major economic metrics and a matter of concern for governments around the world. Though its definition may seem straightforward at first glance (usually defined as the number of unemployed people divided by the active population), it can be tricky to define consistently. For example, one must define what exactly unemployed means : looking for a job ? Having declared their unemployment ? Currently without a job ? Should students or recent graduates be included ? We could also wonder what the active population is : everyone in an age category (e.g. `16-64`) ? Anyone interested by finding a job ? Though these questions may seem subtle, they can have a large impact on the interpretation of the results : `3%` unemployment doesn't mean much if we don't know who is included in this percentage. 

In this homework you will be dealing with two different datasets from the statistics offices of the European commission ([eurostat](http://ec.europa.eu/eurostat/data/database)) and the Swiss Confederation ([amstat](https://www.amstat.ch)). They provide a variety of datasets with plenty of information on many different statistics and demographics at their respective scales. Unfortunately, as is often the case is data analysis, these websites are not always straightforward to navigate. They may include a lot of obscure categories, not always be translated into your native language, have strange link structures, … Navigating this complexity is part of a data scientists' job : you will have to use a few tricks to get the right data for this homework.

For the visualization part, install [Folium](https://github.com/python-visualization/folium) (*HINT*: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!). Folium's `README` comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find two `.topojson` files, containing the geo-coordinates of 

- European countries (*liberal definition of EU*) (`topojson/europe.topojson.json`, [source](https://github.com/leakyMirror/map-of-europe))
- Swiss cantons (`topojson/ch-cantons.topojson.json`) 

These will be used as an overlay on the Folium maps.

In [1]:
import pandas as pd
import numpy as np
import branca.colormap as cm
import folium
import json
import geopandas as gpd
from geopandas import GeoSeries, GeoDataFrame
from folium.plugins import MarkerCluster

#### Paths

In [2]:
europe_csv = 'data/lfsq_urgan_1_Data.csv'
topo_path = r'topojson/europe.topojson.json'

In [3]:
topo_data = json.load(open(topo_path))

# 1) European unemployment

TODO: compare rate in Switzerland to the rest of Europe

Go to the [eurostat](http://ec.europa.eu/eurostat/data/database) website and try to find a dataset that includes the european unemployment rates at a recent date.

   Use this data to build a [Choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) which shows the unemployment rate in Europe at a country level. Think about [the colors you use](https://carto.com/academy/courses/intermediate-design/choose-colors-1/), how you decided to [split the intervals into data classes](http://gisgeography.com/choropleth-maps-data-classification/) or which interactions you could add in order to make the visualization intuitive and expressive. Compare Switzerland's unemployment rate to that of the rest of Europe.

#### Extracting unemployment rates

We use the unemployement rates of the second quarter of 2017. These are the most recent rates found on the eurostat website.

In [4]:
df = pd.read_csv(europe_csv)
df = df.loc[df.TIME == '2017Q2'].loc[df.SEX == 'Total'].loc[:, ['GEO', 'Value']]
df.columns = ['country', 'rate']
df = df[6:].reset_index(drop=True)
df.head()

Unnamed: 0,country,rate
0,Belgium,7.0
1,Bulgaria,6.3
2,Czech Republic,3.0
3,Denmark,5.5
4,Germany (until 1990 former territory of the FRG),3.8


Changing some names.

In [5]:
df.loc[df.country == 'Germany (until 1990 former territory of the FRG)', 'country'] = 'Germany'
df.loc[df.country == 'Former Yugoslav Republic of Macedonia, the', 'country'] = 'The former Yugoslav Republic of Macedonia'

#### Retrieving the two-letter codes for each country. 

These codes are contained in the topojson file provided. They will be useful in order to build the map.

In [6]:
topo_data['objects']['europe']['geometries']

[{'arcs': [[[0, 1, 2]], [[3]], [[4]], [[5, 6, 7, 8, 9, 10], [11]]],
  'id': 'AZ',
  'properties': {'NAME': 'Azerbaijan'},
  'type': 'MultiPolygon'},
 {'arcs': [[12, 13, 14, 15, 16, 17, 18]],
  'id': 'AL',
  'properties': {'NAME': 'Albania'},
  'type': 'Polygon'},
 {'arcs': [[[-12]], [[19, -3, 20, 21, -7], [-5], [-4]]],
  'id': 'AM',
  'properties': {'NAME': 'Armenia'},
  'type': 'MultiPolygon'},
 {'arcs': [[22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
  'id': 'BA',
  'properties': {'NAME': 'Bosnia and Herzegovina'},
  'type': 'Polygon'},
 {'arcs': [[32, 33, 34, 35, 36, 37]],
  'id': 'BG',
  'properties': {'NAME': 'Bulgaria'},
  'type': 'Polygon'},
 {'arcs': [[38]],
  'id': 'CY',
  'properties': {'NAME': 'Cyprus'},
  'type': 'Polygon'},
 {'arcs': [[[39]],
   [[40]],
   [[41]],
   [[42]],
   [[43]],
   [[44]],
   [[45]],
   [[46]],
   [[47]],
   [[48]],
   [[49]],
   [[50]],
   [[51]],
   [[52]],
   [[53]],
   [[54, 55]],
   [[56]],
   [[57]]],
  'id': 'DK',
  'properties': {'NAME': 'Denmar

In [7]:
countries = {}
for d in topo_data['objects']['europe']['geometries']:
    countries[d['properties']['NAME']] = d['id']

In [8]:
countries

{'Albania': 'AL',
 'Andorra': 'AD',
 'Armenia': 'AM',
 'Austria': 'AT',
 'Azerbaijan': 'AZ',
 'Belarus': 'BY',
 'Belgium': 'BE',
 'Bosnia and Herzegovina': 'BA',
 'Bulgaria': 'BG',
 'Croatia': 'HR',
 'Cyprus': 'CY',
 'Czech Republic': 'CZ',
 'Denmark': 'DK',
 'Estonia': 'EE',
 'Faroe Islands': 'FO',
 'Finland': 'FI',
 'France': 'FR',
 'Georgia': 'GE',
 'Germany': 'DE',
 'Greece': 'GR',
 'Holy See (Vatican City)': 'VA',
 'Hungary': 'HU',
 'Iceland': 'IS',
 'Ireland': 'IE',
 'Israel': 'IL',
 'Italy': 'IT',
 'Latvia': 'LV',
 'Liechtenstein': 'LI',
 'Lithuania': 'LT',
 'Luxembourg': 'LU',
 'Malta': 'MT',
 'Monaco': 'MC',
 'Montenegro': 'ME',
 'Netherlands': 'NL',
 'Norway': 'NO',
 'Poland': 'PL',
 'Portugal': 'PT',
 'Republic of Moldova': 'MD',
 'Romania': 'RO',
 'Russia': 'RU',
 'San Marino': 'SM',
 'Serbia': 'RS',
 'Slovakia': 'SK',
 'Slovenia': 'SI',
 'Spain': 'ES',
 'Sweden': 'SE',
 'Switzerland': 'CH',
 'The former Yugoslav Republic of Macedonia': 'MK',
 'Turkey': 'TR',
 'Ukraine': 'U

#### Enterring those codes in the dataframe

In [9]:
for i in df.index:
    df.loc[i, 'code'] = countries[df.loc[i, 'country']]

In [10]:
df

Unnamed: 0,country,rate,code
0,Belgium,7.0,BE
1,Bulgaria,6.3,BG
2,Czech Republic,3.0,CZ
3,Denmark,5.5,DK
4,Germany,3.8,DE
5,Estonia,7.0,EE
6,Ireland,6.4,IE
7,Greece,21.2,GR
8,Spain,17.2,ES
9,France,9.1,FR


In [11]:
df = df.sort_values('rate').reset_index(drop=True)
df.head()

Unnamed: 0,country,rate,code
0,Czech Republic,3.0,CZ
1,Iceland,3.4,IS
2,Germany,3.8,DE
3,Malta,4.1,MT
4,United Kingdom,4.3,GB


#### Choosing intervals for colors :

We use 5 classes:
* 1 for outliers (Spaine, Greece and Macedonia)
* 4 for the other countries using quantile classification 

In [12]:
df.loc[:len(df)-3, 'class'] = pd.qcut(df.rate[:len(df)-3], 4, labels = [i for i in range(1,5)])
df.loc[len(df)-3:, 'class'] = 5

Here is the number of countries for which the unemployment rate is in each interval.

In [13]:
df['class'].value_counts()

4.0    8
1.0    8
3.0    7
2.0    7
5.0    3
Name: class, dtype: int64

And here are the bounds for the intervals

In [14]:
t = [min(df['rate'])]
for i in range(1,6):
    tmp = df.loc[df['class'] == i]['rate'].values
    t.append(max(tmp))
    
t

[3.0,
 4.4000000000000004,
 6.2999999999999998,
 8.0999999999999996,
 11.0,
 22.600000000000001]

#### Building the map

In [15]:
m = folium.Map(location=[50, 10], zoom_start=4, tiles='cartodbpositron')

#### Data missing from the data set.

We color the countries for which there is no data in black. As they will be put to 0 (yellow later), they will appear in green on the final map.

In [16]:
present = set(df['country'].values)

In [17]:
def missing_countries(name):
    if name in present:
        return '#ffffff'
    else:
        return '#000000'

In [18]:
folium.TopoJson(topo_data, object_path='objects.europe', style_function=lambda feature: {
    'fillColor': missing_countries(feature['properties']['NAME']),
    'color' : 'black',
    'opacity' : 0,
    'weight' : 2,
    'dashArray' : '5, 5'
        }).add_to(m)

<folium.features.TopoJson at 0x118976198>

#### Chloropleth

In [19]:
topo_data["objects"]["europe"]

{'geometries': [{'arcs': [[[0, 1, 2]],
    [[3]],
    [[4]],
    [[5, 6, 7, 8, 9, 10], [11]]],
   'id': 'AZ',
   'properties': {'NAME': 'Azerbaijan'},
   'type': 'MultiPolygon'},
  {'arcs': [[12, 13, 14, 15, 16, 17, 18]],
   'id': 'AL',
   'properties': {'NAME': 'Albania'},
   'type': 'Polygon'},
  {'arcs': [[[-12]], [[19, -3, 20, 21, -7], [-5], [-4]]],
   'id': 'AM',
   'properties': {'NAME': 'Armenia'},
   'type': 'MultiPolygon'},
  {'arcs': [[22, 23, 24, 25, 26, 27, 28, 29, 30, 31]],
   'id': 'BA',
   'properties': {'NAME': 'Bosnia and Herzegovina'},
   'type': 'Polygon'},
  {'arcs': [[32, 33, 34, 35, 36, 37]],
   'id': 'BG',
   'properties': {'NAME': 'Bulgaria'},
   'type': 'Polygon'},
  {'arcs': [[38]],
   'id': 'CY',
   'properties': {'NAME': 'Cyprus'},
   'type': 'Polygon'},
  {'arcs': [[[39]],
    [[40]],
    [[41]],
    [[42]],
    [[43]],
    [[44]],
    [[45]],
    [[46]],
    [[47]],
    [[48]],
    [[49]],
    [[50]],
    [[51]],
    [[52]],
    [[53]],
    [[54, 55]],
   

In [20]:
m.choropleth(geo_data=topo_data, data=df,
             columns=['code', 'rate'], topojson='objects.europe',
             key_on='feature.id',
             fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.2,
             
             legend_name='Unemployment rate (%)', threshold_scale=t, name='unemployment layer')

In [21]:
m

#### Bringing interactivity

We want to add some markers on each country in order to get the exact value of unemployement rate by clicking on it.
To do that we use a data set called `country_centroids_primary` that provides the coordinates of centroids of countries. This data set comes from [Gotos]('http://gothos.info/resources/').

In [22]:
centro = pd.read_csv('data/country_centroids_primary.csv', sep=('\t')).loc[:, ['LAT', 'LONG', 'SHORT_NAME']]
centro.columns = ['lat', 'long', 'name']
centro.head()

Unnamed: 0,lat,long,name
0,33.0,66.0,Afghanistan
1,41.0,20.0,Albania
2,28.0,3.0,Algeria
3,-14.333333,-170.0,American Samoa
4,42.5,1.5,Andorra


Now for each country of europe, we add the marker on the centroid. By clicking on the marker, one get the required value.

In [23]:
m_c = MarkerCluster().add_to(m)

for i in df.index:
    n = df.loc[i, 'country']
    p = df.loc[i, 'rate']
    if n == 'The former Yugoslav Republic of Macedonia':
        n = 'Macedonia'
    long, lat = centro.loc[centro.name == n, 'long'].values[0], centro.loc[centro.name == n, 'lat'].values[0]
    folium.Marker([lat, long], popup='{} : {}%'.format(n, p) , icon=folium.Icon(color='green')).add_to(m_c)

In [24]:
m

This is the final map. Green countries are countries for which there is no data.

# 2) Suisse

Go to the [amstat](https://www.amstat.ch) website to find a dataset that includes the unemployment rates in Switzerland at a recent date.

   > *HINT* Go to the `details` tab to find the raw data you need. If you do not speak French, German or Italian, think of using free translation services to navigate your way through. 

   Use this data to build another Choropleth map, this time showing the unemployment rate at the level of swiss cantons. Again, try to make the map as expressive as possible, and comment on the trends you observe.

   The Swiss Confederation defines the rates you have just plotted as the number of people looking for a job divided by the size of the active population (scaled by 100). This is surely a valid choice, but as we discussed one could argue for a different categorization.

   Copy the map you have just created, but this time don't count in your statistics people who already have a job and are looking for a new one. How do your observations change ? You can repeat this with different choices of categories to see how selecting different metrics can lead to different interpretations of the same data.

# 3) Swiss and foreign workers

Use the [amstat](https://www.amstat.ch) website again to find a dataset that includes the unemployment rates in Switzerland at recent date, this time making a distinction between *Swiss* and *foreign* workers.

   The Economic Secretary (SECO) releases [a monthly report](https://www.seco.admin.ch/seco/fr/home/Arbeit/Arbeitslosenversicherung/arbeitslosenzahlen.html) on the state of the employment market. In the latest report (September 2017), it is noted that there is a discrepancy between the unemployment rates for *foreign* (`5.1%`) and *Swiss* (`2.2%`) workers. 

   Show the difference in unemployment rates between the two categories in each canton on a Choropleth map (*hint* The easy way is to show two separate maps, but can you think of something better?). Where are the differences most visible ? Why do you think that is ?

   Now let's refine the analysis by adding the differences between age groups. As you may have guessed it is nearly impossible to plot so many variables on a map. Make a bar plot, which is a better suited visualization tool for this type of multivariate data.

#### Assumptions
- We download the website from the category "2.1 Arbeitslosenquote (worklessness rates)".
- We select data for September 2017, the month in which the latest SECO employment market report was released, to try to reproduce and validate their numbers as closely as possible.
- First, we have to do some cleanup, which is done below. The final dataframe will have a column per canton, two columns for worklessness rates in percentage per nationality (Swiss / foreigner), and two columns for the absolute number of registered worklessness people (Swiss / foreigners).
- TODO this metric (worklessness people) could be called into question -> compare to youth worklessness etc.

In [25]:
worklessness_data = 'data/worklessness_ch.xlsx'

First, we import and preprocess the downloaded data.

In [26]:
df = pd.read_excel(worklessness_data)
df.tail()

Unnamed: 0,Kanton,Nationalität,Monat,September 2017,September 2017.1,September 2017.2,Gesamt,Gesamt.1,Gesamt.2
49,Genf,Ausländer,,5.7,A,5942,5.7,A,5942
50,Genf,Schweizer,,4.8,A,6292,4.8,A,6292
51,Jura,Ausländer,,9.0,C,505,9.0,C,505
52,Jura,Schweizer,,3.6,B,1114,3.6,B,1114
53,Gesamt,,,3.0,A,133169,3.0,A,133169


There are some data on canton, nationality, month (`NaN`), worklessness quote and number as well as a category, which we will not consider. First, we rearrange the data in a new dataframe, saving worklessness quotes and numbers for foreigners and Swiss people in separate columns.

In [27]:
df = pd.read_excel(worklessness_data)
df.columns = (["canton","nationality","month","unemployment_rate","d1","workless_registered","d2","d3","d4"])
df = df.drop(df.index[0]) # delete first row
df = df.drop(["month","d1","d2","d3","d4"],axis=1,)
df.workless_registered = df.workless_registered.astype(int)
df.unemployment_rate = df.unemployment_rate.astype(float)

# new columns

unemp_ch = np.asarray(df[df["nationality"]=="Schweizer"].unemployment_rate)
unemp_for = np.asarray(df[df["nationality"]=="Ausländer"].unemployment_rate)
diff_unemp_ch_for = unemp_for - unemp_ch
unemp_ch_reg = np.asarray(df[df["nationality"]=="Schweizer"].workless_registered)
unemp_for_reg = np.asarray(df[df["nationality"]=="Ausländer"].workless_registered)
cantons = list(np.unique(df[df.canton != "Gesamt"].canton))

# put data in new, clean dataframe
df = pd.DataFrame({"canton":cantons,
                       "unemp_ch": unemp_ch,
                       "unemp_for": unemp_for,
                       "unemp_ch_reg": unemp_ch_reg,
                       "unemp_for_reg": unemp_for_reg,
                  "diff_unemp_ch_for": diff_unemp_ch_for})

df.head()

Unnamed: 0,canton,diff_unemp_ch_for,unemp_ch,unemp_ch_reg,unemp_for,unemp_for_reg
0,Aargau,2.8,2.5,15114,5.3,12111
1,Appenzell Ausserrhoden,3.7,1.8,8758,5.5,4900
2,Appenzell Innerrhoden,2.6,1.3,2292,3.9,1593
3,Basel-Landschaft,1.7,0.4,59,2.1,53
4,Basel-Stadt,2.2,1.2,838,3.4,617


The data above shows the relevant data extracted for September 2017: the name of the canton, the umemployment rate for Swiss people (`umemp_ch_reg`), foreigners (`unemp_for_reg`), as well as total number of registered workless people (`unemp_ch_reg` and `unemp_for_reg`). The difference between the foreign unemployed and Swiss workless quote per canton was derived and saved in `diff_unemp_ch_for`.

#### Unemployment rate by nationality

Since the total number of unemployed people is not divided in initial dataset (as seen above in `df.tail()`), we have to either avarage per month or try to derive it differently.

TODO: try with other downloads

In [28]:
# we can now delete the column for the entire Swiss territory as we'll only consider the cantons hereafter
df = df.drop(df.index[len(df)-1])

df[["unemp_ch","unemp_for"]].mean().round(2)

unemp_ch     1.89
unemp_for    4.28
dtype: float64

To find out the total number of unemployed people, we can ponder the percentage by the number of unemployed people (which gives us information on the total number of people living in the canton). We calculate the unemployment quote for Switzerland by calculating first the total population:
$pop_{canton,CH}=\frac{pop_{unemployed,canton,CH}}{p_{unemployed,canton,CH}}\times 100$

$pop_{canton,foreign}=\frac{pop_{unemployed,canton,foreign}}{p_{unemployed,canton,foreign}}\times 100$

Then, we calculate a new, total worklessness percentage:
$p_{unemployed, CH} = \frac{1}{\sum pop_{canton,CH}}\sum_{i}pop^{(i)}_{unemployed,canton,CH}\times pop^{(i)}_{canton,CH}$

In [29]:
pop_tot_for = np.array(np.round(df["unemp_for_reg"]/df["unemp_for"]*100))
pop_tot_for = pop_tot_for.astype(int)
pop_tot_ch = np.array(np.round(df["unemp_ch_reg"]/df["unemp_ch"]*100))
pop_tot_ch = pop_tot_ch.astype(int)
df["pop_tot_for"]=pop_tot_for
df["pop_tot_ch"]=pop_tot_ch

# get weighted worklessness rates
mean_unemp_ch = 1/np.sum(df["pop_tot_ch"])*np.sum(df["unemp_ch"]*df["pop_tot_ch"])
mean_unemp_for = 1/np.sum(df["pop_tot_for"])*np.sum(df["unemp_for"]*df["pop_tot_for"])
print("unemployment rate for foreigners: " + str(np.round(mean_unemp_for,1))+ "%")
print("unemployment rate for Swiss people: " + str(np.round(mean_unemp_ch,1))+ "%")

unemployment rate for foreigners: 5.1%
unemployment rate for Swiss people: 2.2%


**Note**: We now get exactly the numbers that were mentioned in the report (`5.1%` for foreigners and `2.2%` for Swiss people)!

## Show Map of Switzerland
We first show a chloropleth map of Swiss and foreign worklessness rates for each canton. First, we show separate maps for foreigner and Swiss worklessness rates, then, we show the difference in worklessness rates between foreigners and Swiss.

In [30]:
topo_path = r'topojson/ch-cantons.topojson.json'
topo_data = json.load(open(topo_path))

We add the canton codes (such a BE for Bern) to the dataframe for later fusion with other datasources (latitude and longitude of centroids) and ease of use in mapping .

In [31]:
cantons = {}
for canton in topo_data['objects']['cantons']['geometries']:
    cantons[canton['properties']['name']]  = canton['id']
to_remove = ["Bern/Berne","Graubünden/Grigioni","Fribourg","Ticino","Vaud","Valais/Wallis","Neuchâtel","Genève"]
to_replace = ["Bern", "Graubünden","Freiburg","Tessin","Waadt","Wallis","Neuenburg","Genf"]
for newkey,oldkey in zip(to_replace, to_remove):
    cantons[newkey] = cantons.pop(oldkey)
for i in df.index:
    df.loc[i, 'code'] = cantons[df.loc[i, 'canton']]

The code field is now added:

In [32]:
df.head()

Unnamed: 0,canton,diff_unemp_ch_for,unemp_ch,unemp_ch_reg,unemp_for,unemp_for_reg,pop_tot_for,pop_tot_ch,code
0,Aargau,2.8,2.5,15114,5.3,12111,228509,604560,AG
1,Appenzell Ausserrhoden,3.7,1.8,8758,5.5,4900,89091,486556,AR
2,Appenzell Innerrhoden,2.6,1.3,2292,3.9,1593,40846,176308,AI
3,Basel-Landschaft,1.7,0.4,59,2.1,53,2524,14750,BL
4,Basel-Stadt,2.2,1.2,838,3.4,617,18147,69833,BS


#### Obtain centroids for cantons
- Municipality data are obtained on the [OpenData](https://opendata.swiss/en/dataset/gemeindetypologie-are) platform, which provides a link to official data of the Confederation on commune boarders. They contain a field KT_KZ which is the code of the canton (Kanton-Kennzeichen). We need to dissolve this shapefile to obtain geometries for each canton.
- [Stackoverflow thread for dissolving geometries](https://gis.stackexchange.com/questions/149959/dissolving-polygons-based-on-attributes-with-python-shapely-fiona)
- Finally, we obtain the centroids by using the geopandas library, which allows getting the centroid of Shapefile geometries.

In [33]:
def dissolve_geometries():
    # define your directories and file names
    
    name_in = 'data/ARE_GemTyp00_9.shp'
    name_out = 'data/cantons_dissolved.shp'

    # create a dictionary
    states = {}
    # open your file with geopandas
    counties = GeoDataFrame.from_file(name_in, crs = {'init' :'epsg:21781'})
    for i in range(len(counties)):
        state_id = counties.at[i, 'KT_KZ']
        county_geometry = counties.at[i, 'geometry']
        # if the feature's state doesn't yet exist, create it and assign a list
        if state_id not in states:
            states[state_id] = []
        # append the feature to the list of features
        states[state_id].append(county_geometry)

    # create a geopandas geodataframe, with columns for state and geometry
    states_dissolved = GeoDataFrame(columns=['state', 'geometry'], crs=counties.crs)
    # iterate your dictionary
    for state, county_list in states.items():
        # create a geoseries from the list of features
        geometry = GeoSeries(county_list)
        # use unary_union to join them, thus returning polygon or multi-polygon
        geometry = geometry.unary_union
        # set your state and geometry values
        states_dissolved.set_value(state, 'state', state)
        states_dissolved.set_value(state, 'geometry', geometry)

    # save to file
    states_dissolved.to_file(name_out, driver="ESRI Shapefile")
    return states_dissolved
    
states_dissolved = dissolve_geometries() # takes some time to execute

In [34]:
states = np.array(states_dissolved['state'])

In [35]:
gdf = states_dissolved.centroid.to_crs({'init': 'epsg:4326'})
x = np.asarray(gdf.centroid.map(lambda p: p.x))
y = np.asarray(gdf.centroid.map(lambda p: p.y))
centroids = []
for i in range(len(states)):
    centroids.append([x[i],y[i],states[i]])
centroids = pd.DataFrame(centroids,columns = ["lon","lat","code"])

We now merge the coordinate dataframe with the worklessness dataframe to add the latitude and longitude to it.

In [36]:
df = pd.merge(centroids, df)


Now that the centroid coordinates have been added, we can show a chloropleth map and a label at the centroids of each canton with the unemployment rates (unfortunately, Folium, doesn't offer too many visualization options, so printing labels on the areas is tedious and not very elegant...)


#### Swiss worklessness rates by canton¶

In [37]:
center_ch = [46.801111, 8.226667]

# put map code in a def() for later use
def makemap(field='unemp_ch', t_min = None, t_max = None):
    if t_min is None and t_max is None:
        t_min = np.min([df.unemp_ch.values.min(),df.unemp_for.values.min()])
        t_max = np.max([df.unemp_ch.values.max(),df.unemp_for.values.max()])
    t = list(np.linspace(t_min,t_max,6))
        
    m = folium.Map(location=center_ch, zoom_start=8, tiles='cartodbpositron')
    m.choropleth(geo_data=topo_data, data=df,
                 columns=['code', field], topojson='objects.cantons',
                 key_on='feature.id',
                 fill_color='YlOrRd', fill_opacity=0.7, line_opacity=0.2,
                 legend_name='Unemployment rate (%)', threshold_scale = t, name='unemployment layer',highlight=True)

    # add markers
    m_c = MarkerCluster().add_to(m)
    for i in df.index:
        n = df.loc[i, 'code']
        p = df.loc[i, field]
        long, lat = df.loc[df.code == n, 'lon'].values[0], df.loc[df.code == n, 'lat'].values[0]
        folium.Marker([lat, long], popup='{} : {}%'.format(n, p) , icon=folium.Icon(color='green')).add_to(m_c)
    return m

m = makemap(field = 'unemp_ch')
m

#### Foreigner worklessness rates by canton

In [38]:
m = makemap(field = 'unemp_for')
m


The colors above (same scale) visually make it clear that worklessness is higher for foreigners than for Swiss people.

#### Difference between foreigner and Swiss worklessness rates
Finally, we can show the difference in rates between foreigners and Swiss people. A positive rate indicates higher foreigner worklessness rates, a negative value a higher Swiss worklessness rate.

In [39]:
m = makemap(field = 'diff_unemp_ch_for', t_min = df.diff_unemp_ch_for.min(), t_max = df.diff_unemp_ch_for.max())
m

# 4) Bonus: 

*BONUS*: using the map you have just built, and the geographical information contained in it, could you give a *rough estimate* of the difference in unemployment rates between the areas divided by the [Röstigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben)?

We can already see from the visualizations above that the French-speaking part of Switzerland has higher worklessness rates. To confirm this, we will calculate the mean worklessness rates in the cantons correponding to the language regions of Switzerland, similar as done before for foreigners / Swiss people. We will suppose that bilingual cantons belong to the region of the majority language group.