# Assignment 2: Voting Visualized

## Deadline

Oct. 24th

## Important notes

- Make sure you push on GitHub your notebook with all the cells already evaluated.
- Note that maps do not render in a standard Github environment. You should export them to HTML and link them in your notebook.
- Don't forget to add a textual description of your thought process, the assumptions you made, and the solution you implemented.
- Please write all your comments in English, and use meaningful variable names in your code.
- Your repo should have a single notebook (plus the data files necessary) in the master branch. If there are multiple notebooks present, we will not grade anything. 

## Background


* Are you curious to know what the political leanings of the people of Switzerland are?
* Do you wake up in a cold sweat, wondering which party won the last cantonal parliament election in Vaud?
* Are you looking to learn all sorts of visualizations, including maps, in Python?

If your answer to any of the above is yes, this assignment is just right for you. Otherwise, it's still an assignment, so we're terribly sorry.

The chief aim of this assignment is to familiarize you with visualizations in Python, particularly maps, and also to give you some insight into how visualizations are to be interpreted. The data we will use is the data on Swiss cantonal parliament elections from 2007 to 2018, which contains, for each cantonal election in this time period, the voting percentages for each party and canton.

For the visualization part, install [Folium](Folium) (_Hint: it is not available in your standard Anaconda environment, therefore search on the Web how to install it easily!_). Folium's README comes with very clear examples, and links to their own iPython Notebooks -- make good use of this information. For your own convenience, in this same directory you can already find one TopoJSON file, containing the geo-coordinates of the cantonal borders of Switzerland.

One last, general reminder: back up any hypotheses and claims with data, since this is an important aspect of the course.

In [613]:
# Put your imports here.
import folium
import pandas as pd
import json
from branca.utilities import split_six


In [614]:
data_folder = './data/'

## Task 1: Cartography and census

__A)__ Display a Swiss map that has cantonal borders as well as the national borders. We provide a TopoJSON `data/ch-cantons.topojson.json` that contains the borders of the cantons.

__B)__ Take the spreadsheet `data/communes_pop.xls`, collected from [admin.ch](https://www.bfs.admin.ch/bfs/fr/home/statistiques/catalogues-banques-donnees/tableaux.assetdetail.5886191.html), containing population figures for every commune. You can use [pd.read_excel()](https://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_excel.html) to read the file and to select specific sheets. Plot a histogram of the population counts and explain your observations. Do not use a log-scale plot for now. What does this histogram tell you about urban and rural communes in Switzerland? Are there any clear outliers on either side, and if so, which communes?

__C)__ The figure below represents 4 types of histogram. At this stage, our distribution should look like Fig.(a). A common way to represent [power-laws](https://en.wikipedia.org/wiki/Power_law) is to use a histogram using a log-log scale  -- remember: the x-axis of an histogram is segmented in bins of equal sizes and y-values are the average of each bin. As shown in Fig.(b), small bins sizes might introduce artifacts. Fig.(b) and Fig.(c) are examples of histograms with two different bin sizes. Another great way to visualize such distribution is to use a cumulative representation, as show in Fig.(d), in which the y-axis represents the number of data points with values greater than y.  
  
Create the figures (b) and (d) using the data extracted for task 1B. For Fig.(b), represent two histograms using two different bin sizes and provide a brief description of the results. What does this tell you about the relationship between the two variables, namely the frequency of each bin and the value (i.e. population in case of the communal data) for each bin?

<img src="plaw_crop.png" style="width: 600px;">
  
The figure is extracted from [this paper](https://arxiv.org/pdf/cond-mat/0412004.pdf) that contains more information about this family of distributions.

In [615]:
with open(data_folder + 'ch-cantons.topojson.json') as topojson:
    topojson = json.load(topojson)

## Task 2: Parties visualized

We provide a spreadsheet, `data/voters.xls`, (again) collected from [admin.ch](https://www.bfs.admin.ch/bfs/fr/home/statistiques/politique/elections/conseil-national/force-partis.assetdetail.217195.html), which contains the percentage of voters for each party and for each canton. For the following task, we will focus on the period 2014-2018 (the first page of the spreadsheet). Please report any assumptions you make regarding outliers, missing values, etc. Notice that data is missing for two cantons, namely Appenzell Ausserrhoden and Graubünden, and your visualisations should include data for every other canton.


__A)__ For the period 2014-2018 and for each canton, visualize, on the map, **the percentage of voters** in that canton who voted for the party [`UDC`](https://en.wikipedia.org/wiki/Swiss_People%27s_Party) (Union démocratique du centre). Does this party seem to be more popular in the German-speaking part, the French-speaking part, or the Italian-speaking part?

__B)__ For the same period, now visualize **the number of residents** in each canton who voted for UDC.

__C)__ Which one of the two visualizations above would be more informative in case of a national election with majority voting (i.e. when a party needs to have the largest number of citizens voting for it among all parties)? Which one is more informative for the cantonal parliament elections?

For part B, you can use the `data/national_council_elections.xslx` file ([guess where we got it from](https://www.bfs.admin.ch/bfs/fr/home/statistiques/politique/elections/conseil-national/participation.assetdetail.81625.html)) to have the voting-eligible population of each canton in 2015.

________________________________________
Load voter data and take a peek:

In [616]:
voter_data = pd.read_excel(data_folder + "voters.xls")  # default sheet is the first one
voter_data.head()

Unnamed: 0,"Elections des parlements cantonaux, de 2014 à 2018: force des partis et attribution des listes mixtes* aux partis",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 54,Unnamed: 55,Unnamed: 56,Unnamed: 57,Unnamed: 58,Unnamed: 59,Unnamed: 60,Unnamed: 61,Unnamed: 62,T 17.02.05.02.03
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,Année électorale 2),Participation,PLR 6),,PDC 7),,PS,,...,JB,,Front,,Grut,,Autres 11),,K,Total
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


This does not make much sense.
We notice that there are some rows and columns with only NaN values. These are useless, so we remove them.

In [617]:
voter_data.dropna(how='all', inplace=True)
voter_data.dropna(axis=1, how='all', inplace=True)
voter_data


Unnamed: 0,"Elections des parlements cantonaux, de 2014 à 2018: force des partis et attribution des listes mixtes* aux partis",Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 6,Unnamed: 8,Unnamed: 10,Unnamed: 12,Unnamed: 14,Unnamed: 16,...,Unnamed: 46,Unnamed: 48,Unnamed: 50,Unnamed: 52,Unnamed: 54,Unnamed: 56,Unnamed: 58,Unnamed: 60,Unnamed: 62,T 17.02.05.02.03
2,,Année électorale 2),Participation,PLR 6),PDC 7),PS,UDC,Dém.,PLS 6),AdI,...,PSL,Lega,MCR,LS,JB,Front,Grut,Autres 11),K,Total
5,Zurich,2015,32.6525,17.3278,4.87871,19.7164,30.0232,,,,...,,,,,,,,0.669707,,100
6,Berne,2018,30.5163,11.7179,0.671415,22.3288,26.7609,,,,...,,,,,,,,0.912781,,100
7,Lucerne,2015,38.7413,21.0395,30.8625,11.8489,24.1156,,,,...,,,,,,,,0.0361293,,100
8,Uri 1),2016,61.9891,26.8567,31.2988,12.985,24.0532,,,,...,,,,,,,,2.42827,,100
9,Schwytz,2016,37.7471,21.629,27.1677,12.9254,33.1151,,,,...,,,,,,,,1.51358,,100
11,Obwald,2018,53.7933,17.1747,29.8036,15.0909,24.5323,,,,...,,,,,,,,13.3985,,100
12,Nidwald,2018,54.9216,28.0124,26.7501,4.47005,25.9166,,,,...,,,,,,,,1.44675,,100
13,Glaris,2018,29.4897,18.3928,9.40085,12.7504,25.2754,,,,...,,,,,,,,1.80364,,100
14,Zoug,2014,42.9394,22.1479,26.7831,9.25091,23.6318,,,,...,,,,,,,,1.57992,,100


Much better! Now we can more clearly see the headers and the values.
We note that the rows after row 34 including the canton Jura seems to be containing information text regarding the dataset in the first column and NaN:s in all of the other columns. If this is the case, they should all be removed.

We start by setting the first column of the dataframe as the index and the first row of the dataframe as the header:

In [618]:
voter_data.set_index(voter_data.columns[0], inplace=True)
voter_data.columns = voter_data.iloc[0]  # Set header
voter_data.drop(voter_data.index[[0]], inplace=True)  # Drop row that became header
voter_data.head()

nan,Année électorale 2),Participation,PLR 6),PDC 7),PS,UDC,Dém.,PLS 6),AdI,PEV,...,PSL,Lega,MCR,LS,JB,Front,Grut,Autres 11),K,Total
"Elections des parlements cantonaux, de 2014 à 2018: force des partis et attribution des listes mixtes* aux partis",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zurich,2015,32.6525,17.3278,4.87871,19.7164,30.0232,,,,4.27177,...,,,,,,,,0.669707,,100
Berne,2018,30.5163,11.7179,0.671415,22.3288,26.7609,,,,6.1729,...,,,,,,,,0.912781,,100
Lucerne,2015,38.7413,21.0395,30.8625,11.8489,24.1156,,,,0.199143,...,,,,,,,,0.0361293,,100
Uri 1),2016,61.9891,26.8567,31.2988,12.985,24.0532,,,,,...,,,,,,,,2.42827,,100
Schwytz,2016,37.7471,21.629,27.1677,12.9254,33.1151,,,,0.304428,...,,,,,,,,1.51358,,100


Now we can easily drop the rows and columns which once again does not have any data:

In [619]:
voter_data.dropna(axis=1, how='all', inplace=True)
voter_data.dropna(how='all', inplace=True)
voter_data

nan,Année électorale 2),Participation,PLR 6),PDC 7),PS,UDC,PLS 6),PEV,PCS,PVL,...,PSA,PES,AVF 8),Sol.,DS,UDF,Lega,MCR,Autres 11),Total
"Elections des parlements cantonaux, de 2014 à 2018: force des partis et attribution des listes mixtes* aux partis",Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Zurich,2015,32.6525,17.3278,4.87871,19.7164,30.0232,,4.27177,,7.63786,...,,7.21878,2.97766,,,2.66228,,,0.669707,100
Berne,2018,30.5163,11.7179,0.671415,22.3288,26.7609,,6.1729,,6.91473,...,0.681873,10.1045,0.495841,,0.179432,3.71062,,,0.912781,100
Lucerne,2015,38.7413,21.0395,30.8625,11.8489,24.1156,,0.199143,,4.32021,...,,6.70001,,,,,,,0.0361293,100
Uri 1),2016,61.9891,26.8567,31.2988,12.985,24.0532,,,,,...,,2.37806,,,,,,,2.42827,100
Schwytz,2016,37.7471,21.629,27.1677,12.9254,33.1151,,0.304428,,2.54462,...,,0.800215,,,,,,,1.51358,100
Obwald,2018,53.7933,17.1747,29.8036,15.0909,24.5323,,,,,...,,,,,,,,,13.3985,100
Nidwald,2018,54.9216,28.0124,26.7501,4.47005,25.9166,,,,,...,,13.4041,,,,,,,1.44675,100
Glaris,2018,29.4897,18.3928,9.40085,12.7504,25.2754,,,,5.95738,...,,12.351,,,,,,,1.80364,100
Zoug,2014,42.9394,22.1479,26.7831,9.25091,23.6318,,,,4.97436,...,,11.632,,,,,,,1.57992,100
Fribourg,2016,39.3021,18.1655,23.707,23.5843,19.7176,,,3.64471,2.45073,...,,4.51465,,,,,,,3.97347,100


Manual inspections helps us identify and remove the two cantons without any data:

In [620]:
voter_data.drop(['Appenzell Rh. Int. 4) 5)', 'Grisons 5)'], inplace=True)
voter_data.shape

(24, 22)

24 cantons and 22 political parties, seems to make sense since the number of cantons in Switzerland is 26 and we removed 2!

Now that the data is cleaned we can start on task A:

__A)__ For the period 2014-2018 and for each canton, visualize, on the map, **the percentage of voters** in that canton who voted for the party [`UDC`](https://en.wikipedia.org/wiki/Swiss_People%27s_Party) (Union démocratique du centre). Does this party seem to be more popular in the German-speaking part, the French-speaking part, or the Italian-speaking part?

We want to create the map using folium and add the overlay with the cantons on top. To do this we need to use `data/ch-cantons.topojson.json` again. We need to make sure that we have an identifier in the `voter_data` dataframe that correspond to the values in the `topojson` file. We therefore take a look into it to see what we have:

In [621]:
print(json.dumps(topojson, indent=4))

{
    "type": "Topology",
    "transform": {
        "scale": [
            0.00045364536453645373,
            0.00019901990199019923
        ],
        "translate": [
            5.956,
            45.818
        ]
    },
    "objects": {
        "cantons": {
            "type": "GeometryCollection",
            "geometries": [
                {
                    "type": "Polygon",
                    "arcs": [
                        [
                            0,
                            1,
                            2,
                            3,
                            4,
                            5,
                            6,
                            7,
                            8,
                            9
                        ]
                    ],
                    "id": "ZH",
                    "properties": {
                        "name": "Z\u00fcrich"
                    }
                },
                {
                    "typ

Here we see that each Canton has its own ID. Let's add these ID:s to our `voter_data` dataframe.

In [622]:
cantons = topojson['objects']['cantons']['geometries']
canton_ids = [canton['id'] for canton in cantons]

# Avoiding to add the IDs for the two cantons without data.
voter_data['id'] = [id for id in canton_ids if id != 'AI' and id != 'GR']


We can now create a map focused on Switzerland and add a choropleth overlay onto it, binding the UDC voting percentage data with the topojson data.

In [623]:
m = folium.Map(
    location=[46.80048, 8.30635],
    tiles='Mapbox Bright',
    zoom_start=7.4
)

m.choropleth(
    geo_data=topojson,
    topojson='objects.cantons',
    name='Cantons',
    data=voter_data,
    columns=['id', 'UDC'],
    key_on="feature.id",
    fill_color='GnBu',
    fill_opacity=1,
    legend_name='Percentage of canton votes for UDC (%)',
    reset=True,
)

# Create a style function which blacks out the cantons for which we don't have any data.
style_function = lambda x: {'fillColor': '#grey',
                            'fillOpacity': '1' if
                            (x['id']=='AI' or x['id']=='GR')  else
                            '0'}

folium.TopoJson(
    topojson,
    'objects.cantons',
    name='topojson',
    style_function=style_function,
    control=False
).add_to(m)

folium.LayerControl().add_to(m)

m

The black parts in the map represents the two cantons for which we don't have any data.

From the map we can clearly see that the party seems to be much more popular in the german-speaking part of Switzerland, compared to the french-speaking or italian-speaking part.

__B)__ For the same period, now visualize **the number of residents** in each canton who voted for UDC.

For this task, we need to use the population data which can be found in the `national_council_elections.xlsx` file.

In [624]:
num_res_data = pd.read_excel(data_folder + "national_council_elections.xlsx")
num_res_data

Unnamed: 0,Elections au Conseil national de 2015:,Unnamed: 1,Unnamed: 2,T 17.02.02.04.01
0,"électeurs inscrits, électeurs, participation a...",,,
1,,,,
2,,,,
3,,Electeurs inscrits,Electeurs 2),Participation en %
4,,,,
5,,,,
6,Total,5283556,2563052,48.51
7,,,,
8,Zurich,907623,428837,47.2484
9,Berne,729203,357770,49.0632


We see in the dataframe that what we want is the rows 8-33. Furthermore, since we are only interested in the amount of people eligible for voting, we can conclude that we only want column 2 (`Unnamed: 1`). Let's make these corrections:

In [625]:
num_res_data = num_res_data.iloc[8:34, 0:2]
num_res_data.head()


Unnamed: 0,Elections au Conseil national de 2015:,Unnamed: 1
8,Zurich,907623
9,Berne,729203
10,Lucerne,271143
11,Uri 1),26414
12,Schwytz,102145


We add the canton id column to the dataframe (since the order of the cantons is the same) and rename the population column:

In [626]:
num_res_data['id'] = canton_ids
num_res_data.rename(index=str, columns={"Unnamed: 1": "Voting_population"}, inplace=True)
num_res_data.head()

Unnamed: 0,Elections au Conseil national de 2015:,Voting_population,id
8,Zurich,907623,ZH
9,Berne,729203,BE
10,Lucerne,271143,LU
11,Uri 1),26414,UR
12,Schwytz,102145,SZ


Now we merge the two dataframes `voter_data` and `canton_data` based on id/code:

In [627]:
merged = pd.merge(num_res_data, voter_data, how='inner', on='id')
merged.head()

Unnamed: 0,Elections au Conseil national de 2015:,Voting_population,id,Année électorale 2),Participation,PLR 6),PDC 7),PS,UDC,PLS 6),...,PSA,PES,AVF 8),Sol.,DS,UDF,Lega,MCR,Autres 11),Total
0,Zurich,907623,ZH,2015,32.6525,17.3278,4.87871,19.7164,30.0232,,...,,7.21878,2.97766,,,2.66228,,,0.669707,100
1,Berne,729203,BE,2018,30.5163,11.7179,0.671415,22.3288,26.7609,,...,0.681873,10.1045,0.495841,,0.179432,3.71062,,,0.912781,100
2,Lucerne,271143,LU,2015,38.7413,21.0395,30.8625,11.8489,24.1156,,...,,6.70001,,,,,,,0.0361293,100
3,Uri 1),26414,UR,2016,61.9891,26.8567,31.2988,12.985,24.0532,,...,,2.37806,,,,,,,2.42827,100
4,Schwytz,102145,SZ,2016,37.7471,21.629,27.1677,12.9254,33.1151,,...,,0.800215,,,,,,,1.51358,100


In this new dataframe we create a new column containing the number of residents in each canton who voted for UDC.

This is done using the participation percentage column and the UDC percentage column along with the total population number column.

In [628]:
# Divide by 10000 since we include two percentages in the multiplication
merged['udc_votes'] = merged.Participation*merged.Participation*merged.Voting_population/10000
merged.udc_votes

0     96769.5
1     67906.7
2     40695.5
3       10150
4     14554.1
5     7594.28
6     9293.48
7     2284.37
8     13792.1
9     30279.4
10    22150.2
11    19736.1
12    21520.2
13    14832.4
14    5020.23
15    65364.6
16    44633.9
17    15475.9
18    84744.4
19    64933.6
20    71494.4
21    12447.4
22    35414.5
23      13282
Name: udc_votes, dtype: object

Now we are ready to plot the map again in the same way as we did before, using our new `merged` dataframe!

Once again the black cantons in the cantons means that we don't have any data.

In [629]:
m = folium.Map(
    location=[46.80048, 8.30635],
    tiles='Mapbox Bright',
    zoom_start=7.4
)

m.choropleth(
    geo_data=topojson,
    topojson='objects.cantons',
    name='Cantons',
    data=merged,
    columns=['id', 'udc_votes'],
    key_on="feature.id",
    fill_color='GnBu',
    fill_opacity=1,
    legend_name='Number of residents in each canton who voted for UDC',
    reset=True,
)

# Create a style function which blacks out the cantons for which we don't have any data.
style_function = lambda x: {'fillColor': '#grey',
                            'fillOpacity': '1' if
                            (x['id']=='AI' or x['id']=='GR')  else
                            '0'}

folium.TopoJson(
    topojson,
    'objects.cantons',
    name='topojson',
    style_function=style_function,
    control=False
).add_to(m)

folium.LayerControl().add_to(m)

m

__C)__ Which one of the two visualizations above would be more informative in case of a national election with majority voting (i.e. when a party needs to have the largest number of citizens voting for it among all parties)? Which one is more informative for the cantonal parliament elections?


The visualization made in __A)__ would be more informative in the case of a cantonal parliament election, since we clearly see the percentage in each canton, making it easier for us to predict the outcome of each cantonal election. This visualization would not be very informative in the case of a national election with majority voting, since we don't know the number of citizens that a percentage of the votes in a canton represent.

The visualization made in __B)__ would be more informative in the case of a national election with majority voting if we know the number of people voting in total. Then we can see how large percentage of the voting population that voted for UDC in the whole of Switzerland, and can draw certain conclusions about what the election result should be. Without the number of people voting in each canton, the visiualization in __B)__ would not be very informative for a cantonal election, because we can't gauge the percentage of the votes and with that draw some conclusions.

## Task 3: More socialism or more nationalism?

In this section, we focus on two parties that are representative of the left and the right on the Swiss political spectrum. You will propose a way to visualize their influence over time and for each canton.

__A)__ Take the two parties [`UDC`](https://en.wikipedia.org/wiki/Swiss_People%27s_Party) (Union démocratique du centre) and [`PS`](https://en.wikipedia.org/wiki/Social_Democratic_Party_of_Switzerland) (Parti socialiste suisse). For each canton, we define 'right lean' in a certain period as follows:

$$\frac{VoteShare_{UDC} - VoteShare_{PS}}{VoteShare_{UDC} + VoteShare_{PS}}$$  

Visualize the right lean of each canton on the map. What conclusions can you draw this time? Can you observe the [röstigraben](https://en.wikipedia.org/wiki/R%C3%B6stigraben) ?

__B)__ For each party, devise a way to visualize the difference between its 2014-2018 vote share (i.e. percentage) and its 2010-2013 vote share for each canton. Propose a way to visualize this evolution of the party over time, and justify your choices. There's no single correct answer, but you must reasonably explain your choices.