# "Namentliche Abstimmungen"  in the Bundestag

> Parse and inspect "Namentliche Abstimmungen" (roll call votes) in the Bundestag (the federal German parliament)

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/eschmidt42/bundestag/HEAD)

## Context

The German Parliament is so friendly to put all votes of all members into readable XLSX / XLS files (and PDFs ¯\\\_(ツ)\_/¯ ). Those files  can be found here: https://www.bundestag.de/parlament/plenum/abstimmung/liste. 

Furthermore, the organisation [abgeordnetenwatch](https://www.abgeordnetenwatch.de/) offers a great platform to get to know the individual politicians and their behavior as well as an [open API](https://www.abgeordnetenwatch.de/api) to request data.

## Purpose of this repo

The purpose of this repo is to help collect roll call votes from the parliament's site directly or via abgeordnetenwatch's API and make them available for analysis / modelling. This may be particularly interesting for the upcoming election in 2021. E.g., if you want to see what your local member of the parliament has been up to in terms of public roll call votes relative to the parties, or how individual parties agree in their votes, this dataset may be interesting for you. 

Since the files on the bundestag website are stored in a way making it tricky to automatically crawl them, a bit of manual work is required to generate that dataset. But don't fret! Quite a few recent roll call votes (as of the publishing of this repo) are already prepared for you. But if older or more recent roll call votes are missing, convenience tools to reduce your manual effort are demonstrated below. An alternative route to get the same and more data (on politicians and local parliaments as well) is via the abgeordnetenwatch route.

For your inspiration, I have also included an analysis on how similar parties voted / how similar to parties individual MdBs votes and a small machine learning model which predicts the individual votes of parliament. Teaser: the "fraktionsszwang" seems to exist but is not absolute and the data shows 😁.

## How to install

`pip install bundestag`

## How to use

### Docs

For detailed explanations see:
- parse data from bundestag.de $\rightarrow$ `nbs/00_html_parsing.ipynb`
- parse data from abgeordnetenwatch.de $\rightarrow$ `nbs/03_abgeordnetenwatch.ipynb`
- analyze party / abgeordneten similarity $\rightarrow$ `nbs/01_similarities.ipynb`
- cluster polls $\rightarrow$ `nbs/04_poll_clustering.ipynb`
- predict politician votes $\rightarrow$ `nbs/05_predicting_votes.ipynb`

For a short overview of the highlights see below.

### When developing

Create the virtual environment
```shell
python3 -m venv .venv
source .venv/bin/activate
pip install pip-tools==6.12.3
```

To update the requirements
```shell
pip-compile -o requirements/requirements.txt pyproject.toml --resolver=backtracking
pip-compile --extra dev -o requirements/dev-requirements.txt pyproject.toml  --resolver=backtracking
```

To install the dev requirements
```shell
pip-sync requirements/dev-requirements.txt
```

To make the package available
```shell
pip install -e .
```

### Setup

In [1]:
%load_ext autoreload
%autoreload 2

In [None]:
from bundestag import html_parsing as hp
from bundestag import similarity as sim
from bundestag.gui import MdBGUI, PartyGUI
from bundestag import abgeordnetenwatch as aw
from bundestag import poll_clustering as pc
from bundestag import vote_prediction as vp

from pathlib import Path
import pandas as pd
from fastai.tabular.all import *

### Part 1 - Party/Party similarities and Politician/Party similarities using bundestag.de data

**Loading the data**

If you have cloned the repo you should already have a `bundestag.de_votes.parquet` file in the root directory of the repo. If not feel free to download that file directly.

If you want to have a closer look at the preprocessing please check out `nbs/00_html_parsing.ipynb`.

In [3]:
df = pd.read_parquet(path='bundestag.de_votes.parquet')
df.head(3).T

Unnamed: 0,0,1,2
Wahlperiode,17,17,17
Sitzungnr,198,198,198
Abstimmnr,1,1,1
Fraktion/Gruppe,CDU/CSU,CDU/CSU,CDU/CSU
Name,Aigner,Altmaier,Aumer
Vorname,Ilse,Peter,Peter
Titel,,,
Bezeichnung,Ilse Aigner,Peter Altmaier,Peter Aumer
sheet_name,T_Export,T_Export,T_Export
date,2012-10-18 00:00:00,2012-10-18 00:00:00,2012-10-18 00:00:00


Votes by party

In [4]:
%%time
party_votes = sim.get_votes_by_party(df)
sim.test_party_votes(party_votes)

2021-08-27 06:53:37.585 | INFO     | bundestag.similarity:get_votes_by_party:17 - Computing votes by party and poll


CPU times: user 5.4 s, sys: 0 ns, total: 5.4 s
Wall time: 5.38 s


Re-arranging `party_votes`

In [5]:
%%time
party_votes_pivoted = sim.pivot_party_votes_df(party_votes)
sim.test_party_votes_pivoted(party_votes_pivoted)
party_votes_pivoted.head()

CPU times: user 19 s, sys: 504 ms, total: 19.5 s
Wall time: 19.5 s


Unnamed: 0_level_0,vote,Fraktion/Gruppe,ja,nein,Enthaltung,ungültig,nichtabgegeben
date,title,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2017-12-12,Bundeswehreinsatz gegen die Terrororganisation IS,AfD,0.0,0.967391,0.0,0,0.032609
2017-12-12,Bundeswehreinsatz im Irak,AfD,0.0,0.978261,0.0,0,0.021739
2017-12-12,Bundeswehreinsatz im Mittelmeer (SEA GUARDIAN),AfD,0.913043,0.021739,0.021739,0,0.043478
2017-12-12,Bundeswehreinsatz in Afghanistan (Resolute Support),AfD,0.0,0.956522,0.01087,0,0.032609
2017-12-12,Bundeswehreinsatz in Mali (MINUSMA),AfD,0.0,0.967391,0.0,0,0.032609


**Similarity of a single politician with the parties**

Collecting the politicians votes

In [6]:
%%time
mdb = 'Peter Altmaier'
mdb_votes = sim.prepare_votes_of_mdb(df, mdb)
sim.test_votes_of_mdb(mdb_votes)
mdb_votes.head()

CPU times: user 62.9 ms, sys: 249 µs, total: 63.2 ms
Wall time: 61.7 ms


Unnamed: 0,date,title,ja,nein,Enthaltung,ungültig,nichtabgegeben
1,2012-10-18,Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen),0,0,0,0,1
621,2012-10-25,"17/10059 und 17/11093, Abkommen zwischen Deutschland und der Schweiz",1,0,0,0,0
1241,2012-10-25,"17/11172, Änderungsantrag zum Gesetzentwurf zur Stärkung der deutschen Finanzaufsicht",0,1,0,0,0
1861,2012-10-25,"17/11193, Änderungsantrag zum Jahressteuergesetz 2013",0,0,0,0,1
2481,2012-10-25,"17/11196, Änderungsantrag zum Jahressteuergesetz 2013",0,0,0,0,1


Comparing the politician against the parties

In [7]:
%%time
mdb_vs_parties = (sim.align_mdb_with_parties(mdb_votes, party_votes_pivoted)
                  .pipe(sim.compute_similarity, lsuffix='mdb', rsuffix='party'))
sim.test_mdb_vs_parties(mdb_vs_parties)
mdb_vs_parties.head(3).T

2021-08-27 06:54:02.682 | INFO     | bundestag.similarity:compute_similarity:110 - Computing similarities using `lsuffix` = "mdb", `rsuffix` = "party" and metric = <function cosine_similarity at 0x7fb7e220e0d0>


CPU times: user 81.3 ms, sys: 1.03 ms, total: 82.4 ms
Wall time: 77.5 ms


Unnamed: 0,1,1.1,1.2
date,2012-10-18 00:00:00,2012-10-18 00:00:00,2012-10-18 00:00:00
title,Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen),Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen),Gesetzentwurf 17/9852 und 17/11053 (8. Änderung des Gesetzes gegen Wettbewerbsbeschränkungen)
ja_mdb,0,0,0
nein_mdb,0,0,0
Enthaltung_mdb,0,0,0
ungültig_mdb,0,0,0
nichtabgegeben_mdb,1,1,1
Fraktion/Gruppe,BÜ90/GR,CDU/CSU,DIE LINKE.
ja_party,0.0,0.915612,0.0
nein_party,0.867647,0.0,0.789474


Plotting

In [None]:
sim.plot(mdb_vs_parties, title_overall=f'Overall similarity of {mdb} with all parties',
         title_over_time=f'{mdb} vs time')
plt.tight_layout()
plt.show()

![mdb similarity](./README_files/mdb_similarity_vs_time.png)

**Comparing one specific party against all others**

Collecting party votes

In [8]:
%%time
party = 'SPD'
partyA_vs_rest = (sim.align_party_with_all_parties(party_votes_pivoted, party)
                  .pipe(sim.compute_similarity, lsuffix='a', rsuffix='b'))
sim.test_partyA_vs_partyB(partyA_vs_rest)
partyA_vs_rest.head(3).T

2021-08-27 06:54:02.842 | INFO     | bundestag.similarity:compute_similarity:110 - Computing similarities using `lsuffix` = "a", `rsuffix` = "b" and metric = <function cosine_similarity at 0x7fb7e220e0d0>


CPU times: user 119 ms, sys: 768 µs, total: 120 ms
Wall time: 109 ms


Unnamed: 0_level_0,273,274,275
vote,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
date,2017-12-12 00:00:00,2017-12-12 00:00:00,2017-12-12 00:00:00
title,Bundeswehreinsatz gegen die Terrororganisation IS,Bundeswehreinsatz im Irak,Bundeswehreinsatz im Mittelmeer (SEA GUARDIAN)
Fraktion/Gruppe_a,SPD,SPD,SPD
ja_a,0.810458,0.843137,0.869281
nein_a,0.098039,0.065359,0.039216
Enthaltung_a,0.013072,0.013072,0.006536
ungültig_a,0,0,0
nichtabgegeben_a,0.078431,0.078431,0.084967
Fraktion/Gruppe_b,AfD,AfD,AfD
ja_b,0.0,0.0,0.913043


Plotting

In [None]:
sim.plot(partyA_vs_rest, title_overall=f'Overall similarity of {party} with all parties',
         title_over_time=f'{party} vs time', party_col='Fraktion/Gruppe_b')
plt.tight_layout()
plt.show()

![party similarity](./README_files/party_similarity_vs_time.png)

**GUI to inspect similarities**

To make the above exploration more interactive, the class `MdBGUI` and `PartyGUI` was implemented to quickly go through the different parties and politicians

In [None]:
mdb = MdBGUI(df)

In [None]:
mdb.render()

In [None]:
party = PartyGUI(df)

In [None]:
party.render()

### Part 2 - predicting politician votes using abgeordnetenwatch data

The data used below was processed using `nbs/03_abgeordnetenwatch.ipynb`.

In [9]:
path = Path('./abgeordnetenwatch_data')

#### Clustering polls using Latent Dirichlet Allocation (LDA)

In [10]:
%%time
source_col = 'poll_title'
nlp_col = f'{source_col}_nlp_processed'
num_topics = 5 # number of topics / clusters to identify

st = pc.SpacyTransformer()

# load data and prepare text for modelling
df_polls_lda = (pd.read_parquet(path=path/'df_polls.parquet')
                .assign(**{nlp_col: lambda x: st.clean_text(x, col=source_col)}))

# modelling clusters
st.fit(df_polls_lda[nlp_col].values, mode='lda', num_topics=num_topics)

# creating text features using fitted model
df_polls_lda, nlp_feature_cols = df_polls_lda.pipe(st.transform, col=nlp_col, return_new_cols=True)

# inspecting clusters
display(df_polls_lda.head(3).T)

Unnamed: 0,0,1,2
poll_id,4217,4215,4214
poll_title,Änderung im Infektions­schutz­gesetz,Keine Verwendung von geschlechtergerechter Sprache,Verlängerung des Bundeswehreinsatzes vor der libanesischen Küste (UNIFIL 2021/2022)
poll_first_committee,Ausschuss für Recht und Verbraucherschutz,,Auswärtiger Ausschuss
poll_description,"Abgestimmt wurde über die Paragraphen 9 und 10 des Infektionsschutzgesetzes. Die AfD hatte verlangt, über einzelne Teile des Gesetzentwurfs und den Gesetzentwurf insgesamt, getrennt abzustimmen. Eine namentlicher Abstimmung fand lediglich bezüglich der Änderungen des Infektionsschutzgesetzes statt.\nDer Gesetzentwurf wird mit 408 Ja-Stimmen der Fraktionen CDU/CSU, SPD und Bündnis 90/Die Grünen angenommen. Dagegen stimmten die FDP, Die Linke und die AfD.","Der Bundestag stimmt über einen Antrag der AfD ab, in welchem die Fraktion dazu auffordert, zugunsten einer ""besseren Lesbarkeit"" auf die Verwendung geschlechtergerechter Sprache durch die Bundesregierung sowie in Drucksachen des Bundestages zu verzichten. \nDer Antrag wurde mit 531 Nein-Stimmen der Fraktionen CDU/CSU, SPD, Bündnis90/Die Grünen, Die Linke und FDP abgelehnt. Dafür stimmte lediglich die antragsstellende Fraktion der AfD.","Der von der Bundesregierung eingebrachte Antrag sieht vor, die Beteiligung der Bundeswehr am maritimen Teil der friedenssichernden Mission ""United Nations Interim Force in Lebanon"" (UNIFIL) zu verlängern. Bei dem Einsatz handelt es sich um die Beteiligung deutscher Streitkräfte an der Überwachung der Seegrenzen des Libanon.\nDer Antrag wird mit 468 Ja-Stimmen der Fraktionen CDU/CSU, SPD, FDP und Bündnis 90/Die Grünen angenommen. Die Linke und die AfD stimmten gegen den Antrag."
legislature_id,111,111,111
legislature_period,Bundestag 2017 - 2021,Bundestag 2017 - 2021,Bundestag 2017 - 2021
poll_date,2021-06-24,2021-06-24,2021-06-24
poll_title_nlp_processed,"[Änderung, Infektions­schutz­gesetz]","[Verwendung, geschlechtergerechter, Sprache]","[Verlängerung, Bundeswehreinsatzes, libanesischen, Küste, UNIFIL]"
nlp_dim0,0.730206,0.050009,0.034305
nlp_dim1,0.067006,0.050014,0.03356


CPU times: user 1.83 s, sys: 111 ms, total: 1.94 s
Wall time: 1.94 s


In [None]:
pc.pca_plot_lda_topics(df_polls_lda, st, source_col, nlp_feature_cols)

#### Predicting votes

Loading data

In [11]:
df_all_votes = pd.read_parquet(path=path / 'df_all_votes.parquet')
df_mandates = pd.read_parquet(path=path / 'df_mandates.parquet')
df_polls = pd.read_parquet(path=path / 'df_polls.parquet')

Splitting data set into training and validation set. Splitting randomly here because it leads to an interesting result, albeit not very realistic for production.

In [12]:
splits = RandomSplitter(valid_pct=.2)(df_all_votes)
y_col = 'vote'

Training a neural net to predict `vote` based on embeddings for `poll_id` and `politician name`

In [None]:
%%time
to = TabularPandas(df_all_votes, 
                   cat_names=['politician name', 'poll_id'], # columns in `df_all_votes` to treat as categorical
                   y_names=[y_col], # column to use as a target for the model in `learn`
                   procs=[Categorify],  # processing of features
                   y_block=CategoryBlock,  # how to treat `y_names`, here as categories
                   splits=splits) # how to split the data 

dls = to.dataloaders(bs=512)
learn = tabular_learner(dls) # fastai function to set up a neural net for tabular data
lrs = learn.lr_find() # searches the learning rate
learn.fit_one_cycle(5, lrs.valley) # performs training using one-cycle hyperparameter schedule

**Predictions over unseen data**

Inspecting the predictions of the neural net over the validation set. 

In [None]:
vp.plot_predictions(learn, df_all_votes, df_mandates, df_polls, splits,
                    n_worst_politicians=5)

Splitting our dataset randomly leads to a surprisingly good accuracy of ~88% over the validation set. The most reasonable explanation is that the model encountered polls and how most politicians voted for them already during training. 

This can be interpreted as, if it is known how most politicians will vote during a poll, then the vote of the remaining politicians is highly predictable. Splitting the data set by `poll_id`, as can be done using `vp.poll_splitter` leads to random chance predictions. Anything else would be surprising as well since the only available information provided to the model is who is voting.

**Visualising learned embeddings**

Besides the actual prediction it also is interesting to inspect what the model actually learned. This can sometimes lead to [surprises](https://github.com/entron/entity-embedding-rossmann).

So let's look at the learned embeddings

In [14]:
embeddings = vp.get_embeddings(learn)

To make sense of the embeddings for `poll_id` as well as `politician name` we apply Principal Component Analysis (so one still kind of understands what distances mean) and project down to 2d. 

Using the information which party was most strongly (% of their votes being "yes"), so its strongest proponent, we color code the individual polls.

In [None]:
vp.plot_poll_embeddings(df_all_votes, df_polls, embeddings, df_mandates=df_mandates)

![poll embeddings](./README_files/poll_embeddings.png)

The politician embeddings are color coded using the politician's party membership

In [None]:
vp.plot_politician_embeddings(df_all_votes, df_mandates, embeddings)

![mandate embeddings](./README_files/mandate_embeddings.png)

The politician embeddings may be the most surprising finding in its clarity. It seems we find for polls and politicians 2-3 clusters, but for politicians with a significant grouping of mandates associated with the government coalition. It seems we find one cluster for the government parties and one for the government opposition. 

## To dos / contributing

Any contributions welcome. In the notebooks in `./nbs/` I've listed to dos here and there things which could be done.

**General to dos**:
- Check for discrepancies between bundestag.de and abgeordnetenwatch based data 
- Make the clustering of polls and policitians interactive
- Extend the vote prediction model: currently, if the data is split by poll (which would be the realistic case when trying to predict votes of a new poll), the model is hardly better than chance. It would be interesting to see which information would help improve beyond chance.
- Extend the data processed from the stored json responses from abgeordnetenwatch (currently only using the bare minimum)