# "Namentliche Abstimmungen"  in the Bundestag

> Parse and inspect "Namentliche Abstimmungen" (roll call votes) in the Bundestag (the federal German parliament)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/eschmidt42/bundestag/HEAD)

## Context

The German Parliament is so friendly to put all votes of all members into readable XLSX / XLS files (and PDFs ¯\\\_(ツ)\_/¯ ). Those files  can be found here: https://www.bundestag.de/parlament/plenum/abstimmung/liste. 

Furthermore, the organisation [abgeordnetenwatch](https://www.abgeordnetenwatch.de/) offers a great platform to get to know the individual politicians and their behavior as well as an [open API](https://www.abgeordnetenwatch.de/api) to request data.

## Purpose of this repo

The purpose of this repo is to help collect roll call votes from the parliament's site directly or via abgeordnetenwatch's API and make them available for analysis / modelling. This may be particularly interesting for the upcoming election in 2021. E.g., if you want to see what your local member of the parliament has been up to in terms of public roll call votes relative to the parties, or how individual parties agree in their votes, this dataset may be interesting for you. 

Since the files on the bundestag website are stored in a way making it tricky to automatically crawl them, a bit of manual work is required to generate that dataset. But don't fret! Quite a few recent roll call votes (as of the publishing of this repo) are already prepared for you. But if older or more recent roll call votes are missing, convenience tools to reduce your manual effort are demonstrated below. An alternative route to get the same and more data (on politicians and local parliaments as well) is via the abgeordnetenwatch route.

For your inspiration, I have also included an analysis on how similar parties voted / how similar to parties individual MdBs votes and a small machine learning model which predicts the individual votes of parliament. Teaser: the "fraktionsszwang" seems to exist but is not absolute and the data shows 😁.

## How to install

`pip install bundestag`

## How to use

For detailed explanations see:
- parse data from bundestag.de $\rightarrow$ `nbs/00_html_parsing.ipynb`
- parse data from abgeordnetenwatch $\rightarrow$ `nbs/03_abgeordnetenwatch.ipynb`
- analyze party / abgeordneten similarity $\rightarrow$ `nbs/01_similarities.ipynb`
- cluster polls $\rightarrow$ `nbs/04_poll_clustering.ipynb`
- predict politician votes $\rightarrow$ `nbs/05_predicting_votes.ipynb`

For a short overview of the highlights see below.

### Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
from bundestag import html_parsing as hp
from bundestag import similarity as sim
from bundestag.gui import GUI
from bundestag import abgeordnetenwatch as aw
from bundestag import poll_clustering as pc
from bundestag import vote_prediction as vp

from pathlib import Path
import pandas as pd
from fastai.tabular.all import *

### Part 1 - Party/Party similarities and Politician/Party similarities using bundestag.de data

**Loading the data**

If you have cloned the repo you should already have a `roll_call_votes.parquet` file in the root directory of the repo. If not feel free to download the `roll_call_votes.parquet` file directly.

If you want to have a closer look at the preprocessing please check out `nbs/00_html_parsing.ipynb`.

In [3]:
df = pd.read_parquet(path='roll_call_votes.parquet')
df.head(3).T

Unnamed: 0,0,1,2
Wahlperiode,17,17,17
Sitzungnr,198,198,198
Abstimmnr,1,1,1
Fraktion/Gruppe,CDU/CSU,CDU/CSU,CDU/CSU
Name,Aigner,Altmaier,Aumer
Vorname,Ilse,Peter,Peter
Titel,,,
Bezeichnung,Ilse Aigner,Peter Altmaier,Peter Aumer
sheet_name,T_Export,T_Export,T_Export
date,2012-10-18 00:00:00,2012-10-18 00:00:00,2012-10-18 00:00:00


#### Counting party votes

In [4]:
party_votes, _ = sim.get_votes_by_party(df)
party_votes.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,vote,Enthaltung,ja,nein,nichtabgegeben,ungültig
Fraktion/Gruppe,date,title,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AfD,2017-12-12,Bundeswehreinsatz gegen die Terrororganisation IS,0.0,0.0,0.967391,0.032609,0.0
AfD,2017-12-12,Bundeswehreinsatz im Irak,0.0,0.0,0.978261,0.021739,0.0
AfD,2017-12-12,Bundeswehreinsatz im Mittelmeer (SEA GUARDIAN),0.021739,0.913043,0.021739,0.043478,0.0
AfD,2017-12-12,Bundeswehreinsatz in Afghanistan (Resolute Support),0.01087,0.0,0.956522,0.032609,0.0
AfD,2017-12-12,Bundeswehreinsatz in Mali (MINUSMA),0.0,0.0,0.967391,0.032609,0.0


#### Visualizing similarities of `party` with all other parties over time

In [5]:
%%time
party = 'SPD'
similarity_party_party = (sim.align_party_with_all_parties(party_votes, party)
                          .pipe(sim.compute_similarity, lsuffix='a', rsuffix='b'))
similarity_party_party.head(3).T

CPU times: user 89.9 ms, sys: 0 ns, total: 89.9 ms
Wall time: 87.8 ms


Unnamed: 0_level_0,273,274,275
vote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
date,2017-12-12 00:00:00,2017-12-12 00:00:00,2017-12-12 00:00:00
title,Bundeswehreinsatz gegen die Terrororganisation IS,Bundeswehreinsatz im Irak,Bundeswehreinsatz im Mittelmeer (SEA GUARDIAN)
Fraktion/Gruppe_a,SPD,SPD,SPD
Enthaltung_a,0.013072,0.013072,0.006536
ja_a,0.810458,0.843137,0.869281
nein_a,0.098039,0.065359,0.039216
nichtabgegeben_a,0.078431,0.078431,0.084967
ungültig_a,0.0,0.0,0.0
Fraktion/Gruppe_b,AfD,AfD,AfD
Enthaltung_b,0.0,0.0,0.021739


Visualize as a time series

In [None]:
sim.plot_similarity_over_time(similarity_party_party, 
                              'Fraktion/Gruppe_b',
                              title=f'{party} vs time')

![party similarity](./README_files/party_similarity_vs_time.png)

Politicians (MdB = Mitglied des Bundestages $\rightarrow$ `mdb`) can also be compared agains party votes

In [6]:
%%time
mdb = 'Peter Altmaier'
similarity_mdb_party = (df
                        .pipe(sim.prepare_votes_of_mdb, mdb=mdb)
                        .pipe(sim.align_mdb_with_parties, party_votes=party_votes)
                        .pipe(sim.compute_similarity, lsuffix='mdb', rsuffix='party')
                       )
similarity_mdb_party.head(3).T

CPU times: user 111 ms, sys: 19.2 ms, total: 130 ms
Wall time: 129 ms


Unnamed: 0,1,1.1,1.2
date,2012-10-18 00:00:00,2012-10-18 00:00:00,2012-10-18 00:00:00
title,Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen),Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen),Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen)
ja_mdb,0,0,0
nein_mdb,0,0,0
Enthaltung_mdb,0,0,0
ungültig_mdb,0,0,0
nichtabgegeben_mdb,1,1,1
Fraktion/Gruppe,BÜ90/GR,CDU/CSU,DIE LINKE.
Enthaltung_party,0.0,0.0,0.0
ja_party,0.0,0.915612,0.0


In [None]:
sim.plot_similarity_over_time(similarity_mdb_party, 
                              'Fraktion/Gruppe',
                              title=f'{mdb} vs time')

![mdb similarity](./README_files/mdb_similarity_vs_time.png)

**GUI to inspect similarities**

To make this exploration more convenient, the class `GUI` was implemented to quickly go through the different parties and politicians

In [None]:
GUI(df).render()

### Part 2 - predicting politician votes using abgeordnetenwatch data

The data used below was processed using `nbs/03_abgeordnetenwatch.ipynb`.

In [7]:
legislature_id = 111
aw.ABGEORDNETENWATCH_PATH = Path('./abgeordnetenwatch_data')

#### Clustering polls using Latent Dirichlet Allocation (LDA)

In [8]:
%%time
source_col = 'poll_title'
nlp_col = f'{source_col}_nlp_processed'
num_topics = 5 # number of topics / clusters to identify

st = pc.SpacyTransformer()

# load data and prepare text for modelling
df_polls_lda = (aw.get_polls_df(legislature_id)
                .assign(**{nlp_col: lambda x: st.clean_text(x, col=source_col)}))

# modelling clusters
st.fit(df_polls_lda[nlp_col].values, mode='lda', num_topics=num_topics)

# creating text features using fitted model
df_polls_lda, nlp_feature_cols = df_polls_lda.pipe(st.transform, col=nlp_col, return_new_cols=True)

# inspecting clusters
display(df_polls_lda.head(3).T)

Unnamed: 0,0,1,2
poll_id,4217,4215,4214
poll_title,Änderung im Infektions­schutz­gesetz,Keine Verwendung von geschlechtergerechter Sprache,Verlängerung des Bundeswehreinsatzes vor der libanesischen Küste (UNIFIL 2021/2022)
poll_first_committee,Ausschuss für Recht und Verbraucherschutz,,Auswärtiger Ausschuss
poll_description,"Abgestimmt wurde über die Paragraphen 9 und 10 des Infektionsschutzgesetzes. Die AfD hatte verlangt, über einzelne Teile des Gesetzentwurfs und den Gesetzentwurf insgesamt, getrennt abzustimmen. Eine namentlicher Abstimmung fand lediglich bezüglich der Änderungen des Infektionsschutzgesetzes statt.\nDer Gesetzentwurf wird mit 408 Ja-Stimmen der Fraktionen CDU/CSU, SPD und Bündnis 90/Die Grünen angenommen. Dagegen stimmten die FDP, Die Linke und die AfD.","Der Bundestag stimmt über einen Antrag der AfD ab, in welchem die Fraktion dazu auffordert, zugunsten einer ""besseren Lesbarkeit"" auf die Verwendung geschlechtergerechter Sprache durch die Bundesregierung sowie in Drucksachen des Bundestages zu verzichten. \nDer Antrag wurde mit 531 Nein-Stimmen der Fraktionen CDU/CSU, SPD, Bündnis90/Die Grünen, Die Linke und FDP abgelehnt. Dafür stimmte lediglich die antragsstellende Fraktion der AfD.","Der von der Bundesregierung eingebrachte Antrag sieht vor, die Beteiligung der Bundeswehr am maritimen Teil der friedenssichernden Mission ""United Nations Interim Force in Lebanon"" (UNIFIL) zu verlängern. Bei dem Einsatz handelt es sich um die Beteiligung deutscher Streitkräfte an der Überwachung der Seegrenzen des Libanon.\nDer Antrag wird mit 468 Ja-Stimmen der Fraktionen CDU/CSU, SPD, FDP und Bündnis 90/Die Grünen angenommen. Die Linke und die AfD stimmten gegen den Antrag."
legislature_id,111,111,111
legislature_period,Bundestag 2017 - 2021,Bundestag 2017 - 2021,Bundestag 2017 - 2021
poll_date,2021-06-24,2021-06-24,2021-06-24
poll_title_nlp_processed,"[Änderung, Infektions­schutz­gesetz]","[Verwendung, geschlechtergerechter, Sprache]","[Verlängerung, Bundeswehreinsatzes, libanesischen, Küste, UNIFIL]"
nlp_dim0,0.066793,0.050012,0.033369
nlp_dim1,0.06668,0.799959,0.033362


CPU times: user 2.05 s, sys: 175 ms, total: 2.23 s
Wall time: 2.23 s


In [None]:
pc.pca_plot_lda_topics(df_polls_lda, st, source_col, nlp_feature_cols)

#### Predicting votes

Loading data

In [9]:
all_votes_path = aw.ABGEORDNETENWATCH_PATH / f'compiled_votes_legislature_{legislature_id}.csv'

# reading data frame with vote data from disk which was generated by aw.compile_votes_data
df_all_votes = pd.read_csv(all_votes_path) 

# minor pre-processing
df_all_votes = df_all_votes.assign(**{'politician name':vp.get_politician_names})

# loading info on mandates (party association) and polls (titles and descriptions)
df_mandates = aw.get_mandates_df(legislature_id)
df_mandates['party'] = df_mandates.apply(vp.get_party_from_fraction_string, axis=1)
df_polls = aw.get_polls_df(legislature_id)

Splitting data set into training and validation set. Splitting randomly here because it leads to an interesting result, albeit not very realistic for production.

In [10]:
splits = RandomSplitter(valid_pct=.2)(df_all_votes)
y_col = 'vote'

Training a neural net to predict `vote` based on embeddings for `poll_id` and `politician name`

In [None]:
%%time
to = TabularPandas(df_all_votes, 
                   cat_names=['politician name', 'poll_id'], # columns in `df_all_votes` to treat as categorical
                   y_names=[y_col], # column to use as a target for the model in `learn`
                   procs=[Categorify],  # processing of features
                   y_block=CategoryBlock,  # how to treat `y_names`, here as categories
                   splits=splits) # how to split the data 

dls = to.dataloaders(bs=512)
learn = tabular_learner(dls) # fastai function to set up a neural net for tabular data
lrs = learn.lr_find() # searches the learning rate
learn.fit_one_cycle(5, lrs.valley) # performs training using one-cycle hyperparameter schedule

**Predictions over unseen data**

Inspecting the predictions of the neural net over the validation set. 

In [12]:
vp.plot_predictions(learn, df_all_votes, df_mandates, df_polls, splits,
                    n_worst_politicians=5)

vote_pred,abstain,no,no_show,yes
vote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
abstain,803,50,58,64
no,24,9405,190,138
no_show,103,978,546,917
yes,15,68,168,10954


vote_pred,abstain,no,no_show,yes
vote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
abstain,0.82359,0.051282,0.059487,0.065641
no,0.00246,0.963923,0.019473,0.014144
no_show,0.040487,0.384434,0.214623,0.360456
yes,0.001339,0.006069,0.014993,0.977599


2021-08-25 16:36:02.574 | INFO     | bundestag.vote_prediction:plot_predictions:100 - Overall accuracy = 88.67 %



5 most inaccurately predicted politicians:


Unnamed: 0,politician name,party,prediction_correct
0,Mario Mieruch,fraktionslos,0.333333
1,Heiko Heßenkemper,fraktionslos,0.45
2,Thomas Nord,DIE LINKE,0.463415
3,Axel Troost,DIE LINKE,0.5
4,Katarina Barley,SPD,0.5



5 most inaccurately predicted polls:


Unnamed: 0,poll_id,poll_title,prediction_correct
0,1761,Organspenden-Reform: Zustimmungslösung,0.59375
1,1758,Organspenden-Reform: Widerspruchslösung,0.596899
2,3572,Corona-Maßnahmen: Aussetzung der Schuldenbremse - erster Nachtragshaushalt,0.726708
3,1683,BDS-Bewegung entgegentreten - Antisemitismus bekämpfen,0.745342
4,3571,Fortsetzung des Bundeswehreinsatzes in Afghanistan,0.753623


Splitting our dataset randomly leads to a surprisingly good accuracy of ~88% over the validation set. The most reasonable explanation is that the model encountered polls and how most politicians voted for them already during training. 

This can be interpreted as, if it is known how most politicians will vote during a poll, then the vote of the remaining politicians is highly predictable. Splitting the data set by `poll_id`, as can be done using `vp.poll_splitter` leads to random chance predictions. Anything else would be surprising as well since the only available information provided to the model is who is voting.

**Visualising learned embeddings**

Besides the actual prediction it also is interesting to inspect what the model actually learned. This can sometimes lead to [surprises](https://github.com/entron/entity-embedding-rossmann).

So let's look at the learned embeddings

In [13]:
embeddings = vp.get_embeddings(learn)

To make sense of the embeddings for `poll_id` as well as `politician name` we apply Principal Component Analysis (so one still kind of understands what distances mean) and project down to 2d. 

Using the information which party was most strongly (% of their votes being "yes"), so its strongest proponent, we color code the individual polls.

In [None]:
vp.plot_poll_embeddings(df_all_votes, df_polls, embeddings, df_mandates=df_mandates)

![poll embeddings](./README_files/poll_embeddings.png)

The politician embeddings are color coded using the politician's party membership

In [None]:
vp.plot_politician_embeddings(df_all_votes, df_mandates, embeddings)

![mandate embeddings](./README_files/mandate_embeddings.png)

The politician embeddings may be the most surprising finding in its clarity. It seems we find for polls and politicians 2-3 clusters, but for politicians with a significant grouping of mandates associated with the government coalition. It seems we find one cluster for the government parties and one for the government opposition. 

## To dos / contributing

Any contributions welcome. In the notebooks in `./nbs/` I've listed to dos here and there things which could be done.

**General to dos**:
- Check for discrepancies between bundestag.de and abgeordnetenwatch based data 
- Make the clustering of polls and policitians interactive
- Extend the vote prediction model: currently, if the data is split by poll (which would be the realistic case when trying to predict votes of a new poll), the model is hardly better than chance. It would be interesting to see which information would help improve beyond chance.
- Extend the data processed from the stored json responses from abgeordnetenwatch (currently only using the bare minimum)