# Project 5: Data Science & Machine Learning on Votings of the Swiss National Council

In project 5, we will analyze the voting behavior of the "Nationalrat" of the Swiss parliament in a number of ways. The project consists of 3 different files:

* Data Preparation (this notebook): Prepare the data for the other two notebooks.
* Predictions: Predict the voting behavior of individual members or the entire council.
* Unsupervised: Find lower-dimensional representations of the voting behavior and groups of members of parliament.

# Data Preparation

In this notebook, we clean and reformat the raw data on the votings of the national council analysis such that it will be ready for the other two notebooks. 

In most of the project (i.e., all except one part of the unsupervised learning notebook), we will consider the voting proposals as observational units, and the votes by the members of parliament as variables. With this understanding, additional information about the proposals are also variables added as a column. However, in a strict interpretation, this might not be fully compliant with the ideas of a tidy dataset, but it serves purpose as a joint basis for the supervised and unsupervised learning notebook. In the supervised learning notebook, we will in fact further transform the notebook into a fully tidy representation.

**You have to run this notebook before you can work on the other two.**

**To avoid potential issues with memory limitations (which might result in the kernel dying), we recomment that you click "Close and Shut Down Notebooks" (in the "File" tab) before you start another notebook.**

## Getting the Data
The voting behavior for every member of parliament as well as some information about the subjects of the vote are publicly available from https://www.parlament.ch/de/ratsbetrieb/abstimmungen/abstimmung-nr-xls (though only in German, French and Italian). We have downloaded the data for the summer sessions of the last four years. We will run the notebook on the latest data (i.e., from the year 2024) - but you are of course free to change to a different year, and you can also download further data files, namely from earlier years or from the council of states ("Ständerat").

In [204]:
import pandas as pd
import numpy as np

The data is given as Excel sheets; we load it using `read_excel` from the `pandas` package. It might be worth to open the file in Excel to see how it's structured.

We see that the first two rows have a different format than the rest and do not contain substantial information. We therefore skip these two rows to avoid problems with the parsing of the file content:

In [205]:
file_path = 'Abstimmungen_NR_2024SS_DE.xlsx'
df_votings_raw = pd.read_excel(file_path, header = None, skiprows=range(2))

## Format Headers and Columns
Let's have a look at the data we have just loaded:

In [206]:
df_votings_raw.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,209,210,211,212,213,214,215,216,217,218
0,,,,,,,,,,,...,10851,4179,10822,,,,,,,
1,,,,,,,,,,,...,"Zryd, Andrea","Zuberbühler, David","Zybach, Ursula",,,,,,,
2,,,,,,,,,,,...,NR,NR,NR,,,,,,,
3,,,,,,,,,,,...,S,V,S,,,,,,,
4,,,,,,,,,,,...,BE,AR,BE,,,,,,,
5,,,,,,,,,,,...,24.10.1975,20.02.1979,29.08.1967,,,,,,,
6,,,,,,,,,,,...,04.12.2023,04.12.2023,04.12.2023,,,,,,,
7,Abstimmungsdatum,Rat,Zuständige Kommission,Zuständige Behörde,Geschäftsnummer,Geschäftstitel,Referenznummer,Bedeutung Ja,Bedeutung Nein,Abstimmungsgegenstand,...,,,,Entscheid des Rates,Anzahl 'Ja',Anzahl 'Nein',Anzahl Enthaltungen,Anzahl 'entschuldigt',Anzahl 'nicht teilgenommen',Teilnahme Präsident/in an der Abstimmung
8,29.05.2024,NR,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",28659,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,...,Enthaltung,Ja,Ja,Ja,145,48,3,3,0,Hat nicht teilgenommen
9,27.05.2024,NR,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,28789,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,...,Ja,Nein,Ja,Ja,126,62,0,3,8,Hat nicht teilgenommen


In [207]:
df_votings_raw.tail()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,209,210,211,212,213,214,215,216,217,218
364,14.06.2024,NR,N/A-D-V | WBK-NR | WBK-SR,EJPD,20230070,Austausch von Daten betreffend gesperrte Spiel...,29249,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,...,Ja,Ja,Ja,Ja,195,3,0,1,0,Hat nicht teilgenommen
365,14.06.2024,NR,N/A-D-V | WAK-NR | WAK-SR,EFD,20230077,Abkommen zwischen der Schweiz und Slowenien zu...,29250,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,...,Ja,Nein,Ja,Ja,145,45,7,1,1,Hat nicht teilgenommen
366,14.06.2024,NR,N/A-D-V | WAK-NR | WAK-SR,EFD,20230080,Zusatzabkommen zum Abkommen vom 9. September 1...,29251,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,...,Ja,Ja,Ja,Ja,196,1,1,1,0,Hat nicht teilgenommen
367,14.06.2024,NR,N/A-D-V | SGK-NR | SGK-SR,WBF,20230084,Arbeitslosenversicherungsgesetz (AVIG). Teilre...,29252,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,...,Ja,Ja,Ja,Ja,197,0,1,1,0,Hat nicht teilgenommen
368,14.06.2024,NR,N/A-D-V | WAK-NR | WAK-SR,EFD,20240024,Bundesgesetz über die Besteuerung der Telearbe...,29253,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,...,Hat nicht teilgenommen,Ja,Ja,Ja,195,1,0,1,2,Hat nicht teilgenommen


### Format Headers 
The data is formatted such that the actual voting data only starts at the line 8, with line 7 containing the column titles about the proposals. On the right of the information about the voting proposals, there is one column per member of parliament, containing some general information about that member (name, parliamentary group, canton, date of birth, date of swearing in).

In a transaction system (OLTP), one would have (at least) two different tables, one for the votes and one for the members of parliament. However, for now, we just combine the pieces of information on the members of parliament into one string that we will use as header. This is done in the next cell (you don't have to understand the details of this):

In [209]:
header_rows = df_votings_raw.iloc[:8]
combined_headers = header_rows.apply(lambda x: x.ffill()).apply(lambda x: ' | '.join(x.dropna().astype(str)), axis=0)
print(combined_headers[:15])

0                                      Abstimmungsdatum
1                                                   Rat
2                                 Zuständige Kommission
3                                    Zuständige Behörde
4                                       Geschäftsnummer
5                                        Geschäftstitel
6                                        Referenznummer
7                                          Bedeutung Ja
8                                        Bedeutung Nein
9                                 Abstimmungsgegenstand
10                                         Vorlagetitel
11    Ratsmitglied (Nr) | Name des Ratsmitgliedes | ...
12    4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1...
13    4049 | Aebischer, Matthias | NR | S | BE | 18....
14    10803 | Aellen, Cyril | NR | RL | GE | 29.02.1...
dtype: object


We now build a data frame `df_votings` from the actual data items (i.e., the rows 8 and onwards) and use these combined strings as the new headers:

In [210]:
# Combine headers into one (Ratsmitgliedes info)
df_votings = pd.DataFrame(df_votings_raw.values[8:], columns=combined_headers)
df_votings.head()

Unnamed: 0,Abstimmungsdatum,Rat,Zuständige Kommission,Zuständige Behörde,Geschäftsnummer,Geschäftstitel,Referenznummer,Bedeutung Ja,Bedeutung Nein,Abstimmungsgegenstand,...,"10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023",Entscheid des Rates,Anzahl 'Ja',Anzahl 'Nein',Anzahl Enthaltungen,Anzahl 'entschuldigt',Anzahl 'nicht teilgenommen',Teilnahme Präsident/in an der Abstimmung
0,29.05.2024,NR,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",28659,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,...,Enthaltung,Ja,Ja,Ja,145,48,3,3,0,Hat nicht teilgenommen
1,27.05.2024,NR,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,28789,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,...,Ja,Nein,Ja,Ja,126,62,0,3,8,Hat nicht teilgenommen
2,27.05.2024,NR,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,28790,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,...,Nein,Ja,Nein,Ja,127,62,1,3,6,Hat nicht teilgenommen
3,27.05.2024,NR,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,28792,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,...,Ja,Ja,Ja,Ja,122,65,1,3,8,Hat nicht teilgenommen
4,27.05.2024,NR,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,28793,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,...,Ja,Ja,Ja,Ja,191,0,0,3,5,Hat nicht teilgenommen


In [211]:
df_votings.iloc[:5, :15]

Unnamed: 0,Abstimmungsdatum,Rat,Zuständige Kommission,Zuständige Behörde,Geschäftsnummer,Geschäftstitel,Referenznummer,Bedeutung Ja,Bedeutung Nein,Abstimmungsgegenstand,Vorlagetitel,Ratsmitglied (Nr) | Name des Ratsmitgliedes | Rat | Fraktion | Kanton | Geburtsdatum | Vereidigungsdatum | Vereidigungsdatum,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023","4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023","10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023"
0,29.05.2024,NR,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",28659,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Bundesgesetz über die Eidgenössischen Technisc...,,Ja,Ja,Ja
1,27.05.2024,NR,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,28789,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,,Nein,Ja,Ja
2,27.05.2024,NR,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,28790,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,,Ja,Nein,Ja
3,27.05.2024,NR,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,28792,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,,Ja,Ja,Nein
4,27.05.2024,NR,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,28793,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,,Ja,Ja,Ja


### Drop Redundant Columns

 We drop the redundant columns `Ratsmitglied (Nr) | Name des Ratsmitgliedes | Rat | Fraktion | Kanton | Geburtsdatum | Vereidigungsdatum | Vereidigungsdatum` and `Rat` (which is `NR` for all items, as we only look at the national council). 

In [212]:
# Remove unused columns
df_votings.drop(columns=['Ratsmitglied (Nr) | Name des Ratsmitgliedes | Rat | Fraktion | Kanton | Geburtsdatum | Vereidigungsdatum | Vereidigungsdatum',
                        'Rat', 'Teilnahme Präsident/in an der Abstimmung'], inplace = True)

### Translate Column Headers
Next, we translate the column headers to English

In [213]:
df_votings.rename(columns={
    'Abstimmungsdatum': 'Voting Date', 
    'Zuständige Kommission': 'Responsible Commission',
    'Zuständige Behörde': 'Responsible Authority',
    'Geschäftsnummer': 'Topic Number',
    'Geschäftstitel': 'Topic Title',
    'Referenznummer': 'Reference ID',
    'Bedeutung Ja': 'Meaning of Yes',
    'Bedeutung Nein': 'Meaning of No',
    'Abstimmungsgegenstand': 'Voting Subject',
    'Vorlagetitel': 'Proposal Title',
}, inplace=True)

df_votings.rename(columns={
    'Entscheid des Rates': 'Council Decision', 
    "Anzahl 'Ja'": 'Number of Yes',
    "Anzahl 'Nein'": 'Number of No',
    "Anzhal Enthaltungen": 'Number of Abstentions',
    "Anzahl 'entschuldigt'": 'Number of excused',
    "Anzahl 'nicht teilgenommen'": 'Number of non-participation',

}, inplace=True)

Furthermore, we set the "Referenznummer" of the proposal as index, as it is unique for each topic.

In [214]:
df_votings.set_index('Reference ID', inplace=True)
df_votings

Unnamed: 0_level_0,Voting Date,Responsible Commission,Responsible Authority,Topic Number,Topic Title,Meaning of Yes,Meaning of No,Voting Subject,Proposal Title,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023",...,"10846 | Wyssmann, Rémy | NR | V | SO | 20.06.1967 | 04.12.2023 | 04.12.2023","10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023",Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,29.05.2024,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Bundesgesetz über die Eidgenössischen Technisc...,Ja,...,Ja,Enthaltung,Ja,Ja,Ja,145,48,3,3,0
28789,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Nein,...,Nein,Ja,Nein,Ja,Ja,126,62,0,3,8
28790,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Ja,...,Ja,Nein,Ja,Nein,Ja,127,62,1,3,6
28792,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,Ja,122,65,1,3,8
28793,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,Ja,191,0,0,3,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,14.06.2024,N/A-D-V | WBK-NR | WBK-SR,EJPD,20230070,Austausch von Daten betreffend gesperrte Spiel...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung des Abkom...,Ja,...,Ja,Ja,Ja,Ja,Ja,195,3,0,1,0
29250,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230077,Abkommen zwischen der Schweiz und Slowenien zu...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung eines Pro...,Nein,...,Nein,Ja,Nein,Ja,Ja,145,45,7,1,1
29251,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230080,Zusatzabkommen zum Abkommen vom 9. September 1...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung und die U...,Ja,...,Ja,Ja,Ja,Ja,Ja,196,1,1,1,0
29252,14.06.2024,N/A-D-V | SGK-NR | SGK-SR,WBF,20230084,Arbeitslosenversicherungsgesetz (AVIG). Teilre...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesgesetz über die obligatorische Arbeitslo...,Ja,...,Ja,Ja,Ja,Ja,Ja,197,0,1,1,0


**Exercise**: Adapt the above command to also translate the columns "Entscheid des Rates", "Anzahl 'Ja'", "Anzahl 'Nein'", "Anzahl Enthaltungen", "Anzahl 'entschuldigt'", "Anzahl 'nicht teilgenommen'" to "Council Decision", "Number of Yes", "Number of No", "Number of Abstentions", "Number of excused", "Number of non-participation".

## Overview of the Data
To verify the data is in good shape, we look at the different colums:

first 9 columns contain the proposal information:
- Voting Date
- Responsible Commission
- Responsible Authority
- Topic Number
- Topic Title
- Meaning of Yes
- Meaning of No
- Voting Proposal
- Voting Title

last 6 columns are summary of voting:
- Council Decision
- Number of Yes
- Number of No
- Number of Abstentions
- Number of excused
- Number of non-participation

The columns in between the votes by the members of parliament:
- 4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023
- 10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023
- \...
- 4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023
- 10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023

To verify, we briefly look at each of these groups of columns and print the column names:

In [215]:
case_info_col_count = 9
case_info_cols = list(df_votings.columns[:case_info_col_count])
print(case_info_cols)

['Voting Date', 'Responsible Commission', 'Responsible Authority', 'Topic Number', 'Topic Title', 'Meaning of Yes', 'Meaning of No', 'Voting Subject', 'Proposal Title']


In [216]:
summary_col_count = 6
summary_cols = list(df_votings.columns[-summary_col_count:])
print(summary_cols)

['Council Decision', 'Number of Yes', 'Number of No', 'Anzahl Enthaltungen', 'Number of excused', 'Number of non-participation']


As there are 200 members of parliament, one of them is the president. We only print the first and last few to check:

In [217]:
senators_cols = list(df_votings.columns[case_info_col_count:-summary_col_count])

In [218]:
len(senators_cols)

200

In [219]:
for mem in senators_cols[:3]:
    print(mem)

4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023
4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023
10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023


In [220]:
for mem in senators_cols[-3:]:
    print(mem)

10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023
4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023
10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023


To check the data content, we print the unique text values in some important columns and will replace text values with numbers later.

In [221]:
cols_to_transform_ja_nein = senators_cols + ['Council Decision']
np.unique(df_votings.loc[:,cols_to_transform_ja_nein].values)

array(['Die Präsidentin/der Präsident stimmt nicht', 'Enthaltung',
       'Entschuldigt gem. Art. 57 Abs. 4', 'Hat nicht teilgenommen', 'Ja',
       'Nein'], dtype=object)

In [222]:
df_votings['Responsible Commission'].unique()

array([' FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR',
       ' N/A-D-V | SPK-NR | SPK-SR', ' N/A-D-V | RK-NR | RK-SR',
       ' N/A-D-V | SGK-NR | SGK-SR', ' RK-NR | RK-SR',
       ' FK-NR | FK-SR | N/A-D-V',
       ' FK-NR | FK-SR | N/A-D-V | WAK-NR | WAK-SR', nan,
       ' APK-NR | APK-SR', 'Unknown', ' N/A-D-V | WBK-NR | WBK-SR',
       ' N/A-D-V | UREK-NR | UREK-SR', ' APK-NR | APK-SR | N/A-D-V',
       ' N/A-D-V | WBK-NR', ' N/A-D-V | WAK-NR | WAK-SR',
       ' KVF-NR | KVF-SR | N/A-D-V', ' FK-NR | FK-SR | N/A-D-V | SGK-NR',
       ' FK-NR | FK-SR | LPK-N | LPK-S | N/A-D-V',
       ' APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR',
       ' UREK-NR | UREK-SR', ' N/A-D-V | SGK-NR',
       ' FK-NR | FK-SR | N/A-D-V | SiK-NR | SiK-SR',
       ' N/A-D-V | SiK-NR | SiK-SR', ' N/A-D-V | SiK-NR',
       ' WAK-NR | WAK-SR', ' Bü-N | Bü-SR | N/A-D-

In [223]:
df_votings['Responsible Authority'].unique()

array(['WBF', 'EJPD', 'Parl', 'EDI', 'EFD', 'EDA', nan, 'BK', 'UVEK',
       'Unknown', 'VBS'], dtype=object)

Note that there are missing values in `Responsible Commission` and `Responsible Authority`. We fill them with `Unknown`, as we already have `Unknown` as another value. We first count how many records will be affected:

In [224]:
df_votings[['Responsible Commission', 'Responsible Authority']].isna().sum()

Responsible Commission    98
Responsible Authority      2
dtype: int64

In [225]:
df_votings['Responsible Authority'] = df_votings['Responsible Authority'].fillna('Unknown')
df_votings['Responsible Commission'] = df_votings['Responsible Commission'].fillna('Unknown')

In [226]:
df_votings

Unnamed: 0_level_0,Voting Date,Responsible Commission,Responsible Authority,Topic Number,Topic Title,Meaning of Yes,Meaning of No,Voting Subject,Proposal Title,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023",...,"10846 | Wyssmann, Rémy | NR | V | SO | 20.06.1967 | 04.12.2023 | 04.12.2023","10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023",Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,29.05.2024,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Bundesgesetz über die Eidgenössischen Technisc...,Ja,...,Ja,Enthaltung,Ja,Ja,Ja,145,48,3,3,0
28789,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Nein,...,Nein,Ja,Nein,Ja,Ja,126,62,0,3,8
28790,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Ja,...,Ja,Nein,Ja,Nein,Ja,127,62,1,3,6
28792,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,Ja,122,65,1,3,8
28793,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,Ja,191,0,0,3,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,14.06.2024,N/A-D-V | WBK-NR | WBK-SR,EJPD,20230070,Austausch von Daten betreffend gesperrte Spiel...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung des Abkom...,Ja,...,Ja,Ja,Ja,Ja,Ja,195,3,0,1,0
29250,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230077,Abkommen zwischen der Schweiz und Slowenien zu...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung eines Pro...,Nein,...,Nein,Ja,Nein,Ja,Ja,145,45,7,1,1
29251,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230080,Zusatzabkommen zum Abkommen vom 9. September 1...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung und die U...,Ja,...,Ja,Ja,Ja,Ja,Ja,196,1,1,1,0
29252,14.06.2024,N/A-D-V | SGK-NR | SGK-SR,WBF,20230084,Arbeitslosenversicherungsgesetz (AVIG). Teilre...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesgesetz über die obligatorische Arbeitslo...,Ja,...,Ja,Ja,Ja,Ja,Ja,197,0,1,1,0


## Data Transformation
In order to process the data using different machine learning techniques, we need to process the actual values on our dataframe.

### Deriving new Attributes
One common step is to compute new attributes (or features). For example, we might want to know the percentage of members of parliament that voted 'Yes' to a given proposal:

**Exercise:** Write code to compute the percent of `yes` votes, and store this information as a new column called `Percent_Yes` in the dataframe `df_votings`.

In [227]:
df_votings['Percent_Yes'] = df_votings['Number of Yes'] /(df_votings['Number of Yes'] +df_votings['Number of No'])
df_votings

  df_votings['Percent_Yes'] = df_votings['Number of Yes'] /(df_votings['Number of Yes'] +df_votings['Number of No'])


Unnamed: 0_level_0,Voting Date,Responsible Commission,Responsible Authority,Topic Number,Topic Title,Meaning of Yes,Meaning of No,Voting Subject,Proposal Title,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023",...,"10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023",Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation,Percent_Yes
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,29.05.2024,FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR,WBF,20240031,"Förderung von Bildung, Forschung und Innovatio...",Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Bundesgesetz über die Eidgenössischen Technisc...,Ja,...,Enthaltung,Ja,Ja,Ja,145,48,3,3,0,0.751295
28789,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Nein,...,Ja,Nein,Ja,Ja,126,62,0,3,8,0.670213
28790,27.05.2024,N/A-D-V | SPK-NR | SPK-SR,EJPD,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Ja,...,Nein,Ja,Nein,Ja,127,62,1,3,6,0.671958
28792,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,122,65,1,3,8,0.652406
28793,27.05.2024,N/A-D-V | RK-NR | RK-SR,EJPD,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,...,Ja,Ja,Ja,Ja,191,0,0,3,5,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,14.06.2024,N/A-D-V | WBK-NR | WBK-SR,EJPD,20230070,Austausch von Daten betreffend gesperrte Spiel...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung des Abkom...,Ja,...,Ja,Ja,Ja,Ja,195,3,0,1,0,0.984848
29250,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230077,Abkommen zwischen der Schweiz und Slowenien zu...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung eines Pro...,Nein,...,Ja,Nein,Ja,Ja,145,45,7,1,1,0.763158
29251,14.06.2024,N/A-D-V | WAK-NR | WAK-SR,EFD,20230080,Zusatzabkommen zum Abkommen vom 9. September 1...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesbeschluss über die Genehmigung und die U...,Ja,...,Ja,Ja,Ja,Ja,196,1,1,1,0,0.994924
29252,14.06.2024,N/A-D-V | SGK-NR | SGK-SR,WBF,20230084,Arbeitslosenversicherungsgesetz (AVIG). Teilre...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Schlussabstimmung | * | *,Bundesgesetz über die obligatorische Arbeitslo...,Ja,...,Ja,Ja,Ja,Ja,197,0,1,1,0,1.0


### Conversion to Numeric Values
In the following cell we convert text to numbers.

The possible values for votes are
- Ja
- Nein
- Enthaltung
- Hat nicht teilgenommen
- Entschuldigt gem. Art. 57 Abs. 4
- Die Präsidentin/der Präsident stimmt nicht

We convert "Ja" (yes) and "Nein" (no) to 1 and -1, respectively. We map all different reasons for non-participantion to 0.

In [228]:
df_nr_votings = df_votings.copy()
mapping_ja_nein = {'Ja': 1, 'Nein': -1, 'Enthaltung': 0, 'Hat nicht teilgenommen': 0, 
                   'Entschuldigt gem. Art. 57 Abs. 4': 0,
                   'Die Präsidentin/der Präsident stimmt nicht':0 }
df_nr_votings.loc[:, cols_to_transform_ja_nein] = \
            df_nr_votings.loc[:, cols_to_transform_ja_nein].map(mapping_ja_nein.get)

Next we convert `Responsible Commission` and `Responsible Authority` to one-hot encodings. We can use the function `get_dummies` from the `pandas` library to do so. We start with the `Responsible Commission`:

In [229]:
df_RC_OH_encoded = pd.get_dummies(df_nr_votings['Responsible Commission'],  dtype=int)

Let us look at how the first row was transformed:

In [230]:
df_nr_votings['Responsible Commission'][0]

  df_nr_votings['Responsible Commission'][0]


' FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR'

In [231]:
df_RC_OH_encoded.iloc[0]

 APK-NR | APK-SR                                                                                                                                                                                    0
 APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR    0
 APK-NR | APK-SR | FK-NR | FK-SR | N/A-D-V                                                                                                                                                          0
 APK-NR | APK-SR | N/A-D-V                                                                                                                                                                          0
 Bü-N | Bü-SR | N/A-D-V                                                                                                                                                                             0
 FK-NR | F

We see that we got many new columns, one for each unique value in the original column `df_nr_votings['Responsible Commission']`. In the first row, the resonsible commission was `' FK-NR | FK-SR | N/A-D-V | WBK-NR | WBK-SR'`. In the transformed dataframe, there is a 1 in the column corresponding to this commission, and all other columns are set to 0. This is why this type of representing categorical attributes is called *one-hot encoding*.

We do the same for the `'Responsible Authority'`.

**Exercise:** Convert `'Responsible Authority'` to one-hot encoding.

In [None]:
# df_RA_OH_encoded = ...
df_RA_OH_encoded= pd.get_dummies(df_nr_votings['Responsible Authority'],  dtype=int)


Next, we combine the original dataframe with the two newly created dataframes containing the one-hot encodings.

**Exercise:** Write code to combine the three dataframes into a single dataframe called `df_nr_votings`. Since we have replaced the two columns `'Responsible Commission'` and `'Responsible Authority'`, by one-hot encodings, we also want to drop the two original columns

In [248]:
df_RA_OH_encoded
df_RC_OH_encoded
df_votings
df_nr_votings = pd.concat([df_votings,df_RC_OH_encoded,df_RA_OH_encoded],axis=1,join="inner")
df_nr_votings
df_nr_votings.drop(columns=["Responsible Commission","Responsible Authority"],inplace=True)


In [249]:
df_nr_votings.head(15)

Unnamed: 0_level_0,Voting Date,Topic Number,Topic Title,Meaning of Yes,Meaning of No,Voting Subject,Proposal Title,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023","4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023","10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023",...,BK,EDA,EDI,EFD,EJPD,Parl,UVEK,Unknown,VBS,WBF
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,29.05.2024,20240031,"Förderung von Bildung, Forschung und Innovatio...",Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Bundesgesetz über die Eidgenössischen Technisc...,Ja,Ja,Ja,...,0,0,0,0,0,0,0,0,0,1
28789,27.05.2024,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit | * | *,Antrag der Minderheit Steinemann (gemäss SR) |...,Art. 50 Abs. 2 Bst. a Ziff. 2 | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Nein,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
28790,27.05.2024,20210504,Bei häuslicher Gewalt die Härtefallpraxis nach...,Antrag der Mehrheit (gemäss SR und BR) | * | *,Antrag der Minderheit Schläfli (festhaten) | *...,Art. 50 Abs. 2bis | * | *,Bundesgesetz über die Ausländerinnen und Auslä...,Ja,Nein,Ja,...,0,0,0,0,1,0,0,0,0,0
28792,27.05.2024,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Antrag der Mehrheit | * | *,Antrag der Minderheit von Falkenstein (gemäss ...,Art. 105a (gilt auch für Art. 9a Abs. 2 Partne...,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,Ja,Nein,...,0,0,0,0,1,0,0,0,0,0
28793,27.05.2024,20230057,ZGB. Änderung (Massnahmen gegen Minderjährigen...,Annahme der Vorlage | * | *,Ablehnung der Vorlage | * | *,Gesamtabstimmung | * | *,Schweizerisches Zivilgesetzbuch (Massnahmen ge...,Ja,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
28794,27.05.2024,20233838,Migrationspartnerschaften. Eine strategische N...,Antrag der Mehrheit (Annahme der geänderten Mo...,Antrag der Minderheit Klopfenstein Broggini (A...,* | * | *,*,Ja,Nein,Ja,...,0,0,0,0,1,0,0,0,0,0
28795,27.05.2024,20234241,Korrektur der Praxisänderung in Bezug auf Asyl...,Antrag der Mehrheit und des Bundesrates (Ableh...,Antrag der Minderheit Schilliger (Annahme der ...,* | * | *,*,Nein,Ja,Nein,...,0,0,0,0,1,0,0,0,0,0
28796,27.05.2024,20243008,Schutz von Afghaninnen. Einzelfallprüfung und ...,Antrag der Kommission (Annahme der Bst. a) | *...,Antrag des Bundesrates (Ablehnung der Bst. a) ...,Buchstabe a | * | *,*,Ja,Nein,Ja,...,0,0,0,0,1,0,0,0,0,0
28797,27.05.2024,20243008,Schutz von Afghaninnen. Einzelfallprüfung und ...,Antrag der Kommission (Annahme der Bst. c) | *...,Antrag des Bundesrates Ablehnung der Bst. c) |...,Buchstabe c | * | *,*,Ja,Nein,Ja,...,0,0,0,0,1,0,0,0,0,0
28798,27.05.2024,20210511,"Gleichstellung von Witwen und Witwern, sobald ...",Antrag der Mehrheit (Folge geben) | * | *,Antrag der Minderheit Gutjahr (keine Folge geg...,* | * | *,*,Nein,Ja,Nein,...,0,0,0,0,0,1,0,0,0,0


### Handling Missing Values
Finally, we have to check whether there are any missing values. 

Missing values are typically indicated as `NA`. However, some missing values might also be more difficult to find, as they might be encoded in a different way. In our dataframe, there are some records where the `Topic Number` is set to `'Unknown'`.

**Exercise:** Implement the following steps to get rid of missing values:
* Count the total number of `na` values. Remember that you can use the functions `isna()` in any dataframe to find out, for every cell of the dataframe, whether the corresponding value is `NA` or not. Furthermore, you can use the function `df.sum()` to sum over the first dimension of a dataframe `df`.
* Delete all records (rows) which contain a `NA` value.
* Identify (e.g., print out) the records that have an `Unknown` `Topic Number`. Decide on what to do with these.

In [254]:
print(df_nr_votings.isna().sum())
df_nr_votings.dropna()
selected_rows=df_nr_votings.loc[df_nr_votings["Topic Number"]=="Unknown"]
print(selected_rows)

Voting Date       0
Topic Number      0
Topic Title       0
Meaning of Yes    0
Meaning of No     0
                 ..
Parl              0
UVEK              0
Unknown           0
VBS               0
WBF               0
Length: 250, dtype: int64
             Voting Date Topic Number Topic Title  \
Reference ID                                        
28955         29.05.2024      Unknown     Unknown   
28957         14.06.2024      Unknown     Unknown   

                                     Meaning of Yes  \
Reference ID                                          
28955         Zustimmung zum Ordnungsantrag | * | *   
28957         Zustimmung zum Ordnungsantrag | * | *   

                                       Meaning of No  \
Reference ID                                           
28955         Ablehnung des Ordnungsantrages | * | *   
28957         Ablehnung des Ordnungsantrages | * | *   

                                                 Voting Subject  \
Reference ID                

### Vectorizing Text Data
Some of the columns contain text data. We will vectorize these texts using count vectorizers to get a representation that we will use afterwards.

The following block of code runs a joint `CountVectorizer` for the `text_columns`. Using a `for` loop, it then iterates over the columns and does a vectorization. 
* In the first line of the `for` loop, the count vectorizer is used to transform the data. This yields a matrix (`transformed_data`) where each row corresponds to a voting proposal, and each column corresponds to the number of times the respective word occurrs in that text column.
* In the second column, a `DataFrame` is generated from the matrix `transformed_data`, with the column name representing first the name of the column of the original data frame, and then, separated by a `_`, the word that is counted in the respective column of the output dataframe.
* Finally, in the third line of the `for` loop, the new data frame is combined with the previously generated text representations in `df_nr_vectorized_text_info`.

In [255]:
from sklearn.feature_extraction.text import CountVectorizer

In [260]:
# Identify columns that contain text
text_columns = ['Topic Title', 'Meaning of Yes', 'Meaning of No', 'Voting Subject', 'Proposal Title']

df_nr_vectorized_text_info = pd.DataFrame({})

# Transform each text column using CountVectorizer
for col in text_columns:
    # Initialize CountVectorizer
    vectorizer = CountVectorizer(max_features=50)
    # adapt the vectorizer to the dataset (i.e., the column we are currently considering
    transformed_data = vectorizer.fit_transform(df_nr_votings[col])
    # convert the data to an understandable dataframe
    transformed_df = pd.DataFrame(transformed_data.toarray(), columns=[f"{col}_{word}" for word in vectorizer.get_feature_names_out()], 
                                 index=df_nr_votings.index)
    df_nr_vectorized_text_info = pd.concat([df_nr_vectorized_text_info, transformed_df], axis = 1)
    print(transformed_df)

              Topic Title_2023  Topic Title_2024  Topic Title_2025  \
Reference ID                                                         
28659                        0                 0                 1   
28789                        0                 0                 0   
28790                        0                 0                 0   
28792                        0                 0                 0   
28793                        0                 0                 0   
...                        ...               ...               ...   
29249                        0                 0                 0   
29250                        0                 0                 0   
29251                        0                 0                 0   
29252                        0                 0                 0   
29253                        0                 0                 0   

              Topic Title_2027  Topic Title_2028  Topic Title_abkommen  \
Reference ID   

The `df_nr_vectorized_text_info` now contains all the proposals as rows, and all the information of the text columns as columns.

**Exercise:** While the technical aspect of the above code cell is nontrivial (but also not needed for the rest of the project), it's important you understand the result of this. Take a moment to understand the representation we have generated. For example, look at `df_nr_vectorized_text_info` or the intermediate results in `transformed_data` and `transformed_df`. 

**Optional Exercise**: In the above cell, we have not used any stop-words. When looking at `df_nr_vectorized_text_info`, you will find both very common words and numbers being counted (and the counts being represented in the columns of `df_nr_vectorized_text_info`). 

Processing text data is often tricky and can be tedious, but also can have a significant impact on the performance. Play around with the preprocessing, e.g. with the following modifications (and combinations thereof):
* `CountVectorizer()` takes an optional argument `stop_words` and a list of words to be ignored (typically, because they are considered very common and thus uninformative). For example, you can write `vectorizer = CountVectorizer(stop_words = ['der', 'die', 'das'])`. Add your own stopwords, or find a list of common words in German (unfortunately, `scikit-learn` has a predefined list of stopwords only for English).
* You can replace any number by `NUM` by replacing `df_nr_votings[col]` (in the first line of the `for` loop) by `df_nr_votings[col].replace('\d+', 'NUM', regex=True)`; the vectorizer will then only see `NUM` instead of any number. Alternatively, you can use `df_nr_votings[col].replace('\d+', '', regex=True)` to remove any number in the texts.

Finally, we make make a copy of the original `df_nr_votings` dataframe where we remove the text columns. We save this as `df_nr_votings_numeric`:

In [261]:
df_nr_votings_numeric = df_nr_votings.copy().drop(columns=text_columns)

Let's look at the data frame `df_nr_votings_numeric`:

In [262]:
df_nr_votings_numeric

Unnamed: 0_level_0,Voting Date,Topic Number,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023","4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023","10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023","4053 | Aeschi, Thomas | NR | V | ZG | 13.01.1979 | 04.12.2023 | 04.12.2023","10812 | Alijaj, Islam | NR | S | ZH | 18.06.1986 | 04.12.2023 | 04.12.2023","4090 | Amaudruz, Céline | NR | V | GE | 15.03.1979 | 04.12.2023 | 04.12.2023","4320 | Amoos, Emmanuel | NR | S | VS | 31.07.1980 | 04.12.2023 | 04.12.2023","4245 | Andrey, Gerhard | NR | G | FR | 21.01.1976 | 04.12.2023 | 04.12.2023",...,BK,EDA,EDI,EFD,EJPD,Parl,UVEK,Unknown,VBS,WBF
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,29.05.2024,20240031,Ja,Ja,Ja,Ja,Nein,Ja,Nein,Nein,...,0,0,0,0,0,0,0,0,0,1
28789,27.05.2024,20210504,Nein,Ja,Ja,Nein,Ja,Nein,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
28790,27.05.2024,20210504,Ja,Nein,Ja,Ja,Nein,Ja,Nein,Nein,...,0,0,0,0,1,0,0,0,0,0
28792,27.05.2024,20230057,Ja,Ja,Nein,Ja,Ja,Ja,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
28793,27.05.2024,20230057,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,14.06.2024,20230070,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,0,0,0,0,1,0,0,0,0,0
29250,14.06.2024,20230077,Nein,Ja,Ja,Nein,Ja,Ja,Ja,Ja,...,0,0,0,1,0,0,0,0,0,0
29251,14.06.2024,20230080,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,0,0,0,1,0,0,0,0,0,0
29252,14.06.2024,20230084,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,0,0,0,0,0,0,0,0,0,1


It contains two columns with general information, then the cast votes of every member of parliament (one column for each of the 200 members), and then some more information about the proposal. To avoid confusion, we separate the cast votes from the information about the subject:

In [263]:
df_nr_cast_votes = df_nr_votings_numeric.copy().iloc[:, 2:202]
df_nr_numeric_info = df_nr_votings_numeric.drop(columns=df_nr_cast_votes.columns)

`df_nr_numeric_info` still contains a column `Voting Date`, which is not numeric, and which we will not use in our analysis. We therefore drop it as well:

In [264]:
df_nr_numeric_info.drop(columns='Voting Date', inplace=True)

## Data Overview
To summarize, we now have the following three dataframes:
* `df_nr_numeric_info`: Numeric information about the voting proposals, e.g., the responsible administration.
* `df_nr_vectorized_text_info`: vectorized text information about the voting proposals.
* `df_nr_cast_votes`: the votes by each member of parliament (each column corresponds to a member of parliament).
In all 3 dataframes, each line is one voting proposal.

In [265]:
df_nr_numeric_info

Unnamed: 0_level_0,Topic Number,Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation,Percent_Yes,APK-NR | APK-SR,APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR,...,BK,EDA,EDI,EFD,EJPD,Parl,UVEK,Unknown,VBS,WBF
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,20240031,Ja,145,48,3,3,0,0.751295,0,0,...,0,0,0,0,0,0,0,0,0,1
28789,20210504,Ja,126,62,0,3,8,0.670213,0,0,...,0,0,0,0,1,0,0,0,0,0
28790,20210504,Ja,127,62,1,3,6,0.671958,0,0,...,0,0,0,0,1,0,0,0,0,0
28792,20230057,Ja,122,65,1,3,8,0.652406,0,0,...,0,0,0,0,1,0,0,0,0,0
28793,20230057,Ja,191,0,0,3,5,1.0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,20230070,Ja,195,3,0,1,0,0.984848,0,0,...,0,0,0,0,1,0,0,0,0,0
29250,20230077,Ja,145,45,7,1,1,0.763158,0,0,...,0,0,0,1,0,0,0,0,0,0
29251,20230080,Ja,196,1,1,1,0,0.994924,0,0,...,0,0,0,1,0,0,0,0,0,0
29252,20230084,Ja,197,0,1,1,0,1.0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [266]:
df_nr_vectorized_text_info

Unnamed: 0_level_0,Topic Title_2023,Topic Title_2024,Topic Title_2025,Topic Title_2027,Topic Title_2028,Topic Title_abkommen,Topic Title_als,Topic Title_an,Topic Title_arbeitslosigkeit,Topic Title_auch,...,Proposal Title_obligatorische,Proposal Title_schweizerisches,Proposal Title_stipendien,Proposal Title_studierende,Proposal Title_und,Proposal Title_von,Proposal Title_zivilgesetzbuch,Proposal Title_zusammenarbeit,Proposal Title_änderung,Proposal Title_über
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
28789,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,2
28790,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,2
28792,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
28793,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,2,1,0,0,0,2
29250,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,1,1
29251,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,2,0,0,0,0,1
29252,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0,0,1


In [267]:
df_nr_cast_votes

Unnamed: 0_level_0,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023","4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023","10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023","4053 | Aeschi, Thomas | NR | V | ZG | 13.01.1979 | 04.12.2023 | 04.12.2023","10812 | Alijaj, Islam | NR | S | ZH | 18.06.1986 | 04.12.2023 | 04.12.2023","4090 | Amaudruz, Céline | NR | V | GE | 15.03.1979 | 04.12.2023 | 04.12.2023","4320 | Amoos, Emmanuel | NR | S | VS | 31.07.1980 | 04.12.2023 | 04.12.2023","4245 | Andrey, Gerhard | NR | G | FR | 21.01.1976 | 04.12.2023 | 04.12.2023","4184 | Arslan, Sibel | NR | G | BS | 23.06.1980 | 04.12.2023 | 04.12.2023","4246 | Badertscher, Christine | NR | G | BE | 11.01.1982 | 04.12.2023 | 04.12.2023",...,"4298 | Weichelt, Manuela | NR | G | ZG | 21.07.1967 | 04.12.2023 | 04.12.2023","4057 | Wermuth, Cédric | NR | S | AG | 19.02.1986 | 04.12.2023 | 04.12.2023","4299 | Wettstein, Felix | NR | G | SO | 19.01.1958 | 04.12.2023 | 04.12.2023","4300 | Widmer, Céline | NR | S | ZH | 26.05.1978 | 04.12.2023 | 04.12.2023","4305 | Wismer-Felder, Priska | NR | M-E | LU | 02.10.1970 | 04.12.2023 | 04.12.2023","4318 | Wyss, Sarah | NR | S | BS | 03.08.1988 | 04.12.2023 | 04.12.2023","10846 | Wyssmann, Rémy | NR | V | SO | 20.06.1967 | 04.12.2023 | 04.12.2023","10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023"
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,Ja,Ja,Ja,Ja,Nein,Ja,Nein,Nein,Nein,Nein,...,Nein,Nein,Nein,Nein,Ja,Nein,Ja,Enthaltung,Ja,Ja
28789,Nein,Ja,Ja,Nein,Ja,Nein,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Nein,Ja,Nein,Ja
28790,Ja,Nein,Ja,Ja,Nein,Ja,Nein,Nein,Nein,Nein,...,Nein,Nein,Nein,Nein,Ja,Nein,Ja,Nein,Ja,Nein
28792,Ja,Ja,Nein,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Nein,Ja,Ja,Ja,Ja,Ja
28793,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja
29250,Nein,Ja,Ja,Nein,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Nein,Ja,Nein,Ja
29251,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja
29252,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,...,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja,Ja


Finally, as we will train some models using all available data (i.e., the numerical and the text data), we combine `df_nr_numeric_info` and `df_nr_vectorized_text_info` to `df_nr_all_info`:

In [268]:
df_nr_all_info = pd.concat([df_nr_numeric_info, df_nr_vectorized_text_info], axis=1)
df_nr_all_info.head()

Unnamed: 0_level_0,Topic Number,Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation,Percent_Yes,APK-NR | APK-SR,APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR,...,Proposal Title_obligatorische,Proposal Title_schweizerisches,Proposal Title_stipendien,Proposal Title_studierende,Proposal Title_und,Proposal Title_von,Proposal Title_zivilgesetzbuch,Proposal Title_zusammenarbeit,Proposal Title_änderung,Proposal Title_über
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,20240031,Ja,145,48,3,3,0,0.751295,0,0,...,0,0,0,0,0,0,0,0,0,1
28789,20210504,Ja,126,62,0,3,8,0.670213,0,0,...,0,0,0,0,3,0,0,0,0,2
28790,20210504,Ja,127,62,1,3,6,0.671958,0,0,...,0,0,0,0,3,0,0,0,0,2
28792,20230057,Ja,122,65,1,3,8,0.652406,0,0,...,0,1,0,0,0,0,1,0,0,0
28793,20230057,Ja,191,0,0,3,5,1.0,0,0,...,0,1,0,0,0,0,1,0,0,0


In [269]:
df_nr_all_info.describe()

Unnamed: 0,APK-NR | APK-SR,APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR,APK-NR | APK-SR | FK-NR | FK-SR | N/A-D-V,APK-NR | APK-SR | N/A-D-V,Bü-N | Bü-SR | N/A-D-V,FK-NR | FK-SR | LPK-N | LPK-S | N/A-D-V,FK-NR | FK-SR | N/A-D-V,FK-NR | FK-SR | N/A-D-V | SGK-NR,FK-NR | FK-SR | N/A-D-V | SiK-NR | SiK-SR,FK-NR | FK-SR | N/A-D-V | WAK-NR | WAK-SR,...,Proposal Title_obligatorische,Proposal Title_schweizerisches,Proposal Title_stipendien,Proposal Title_studierende,Proposal Title_und,Proposal Title_von,Proposal Title_zivilgesetzbuch,Proposal Title_zusammenarbeit,Proposal Title_änderung,Proposal Title_über
count,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0,...,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0,361.0
mean,0.00277,0.00831,0.00277,0.00831,0.00277,0.030471,0.041551,0.01108,0.00831,0.00277,...,0.022161,0.036011,0.033241,0.033241,0.315789,0.036011,0.030471,0.063712,0.022161,0.515235
std,0.052632,0.090907,0.052632,0.090907,0.052632,0.172118,0.199838,0.104824,0.090907,0.052632,...,0.14741,0.186576,0.179514,0.179514,0.74162,0.186576,0.172118,0.244578,0.14741,0.568053
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,4.0,1.0,1.0,1.0,1.0,3.0


## Save the datasets

Finally we are ready to save the dataframes into files that we will load when running  other analysis steps.

In [270]:
df_nr_numeric_info.to_csv(file_path.replace('.xlsx', '_numeric_info.csv'))
df_nr_vectorized_text_info.to_csv(file_path.replace('.xlsx', '_vectorized_text_info.csv'))
df_nr_cast_votes.to_csv(file_path.replace('.xlsx', '_cast_votes.csv'))
df_nr_all_info.to_csv(file_path.replace('.xlsx', '_all_info.csv'))

Furthermore, for later reference, we save a brief summary for every voting proposal:

In [271]:
info_cols = [ x for x in (case_info_cols + summary_cols + ['Percent_Yes']) if not x in ['Responsible Commission', 'Responsible Authority'] ]

In [272]:
df_nr_votings[ info_cols ].to_csv(file_path.replace('.xlsx', '_summary.csv'))

**Exercise:**
* Given the dataset that we have now cleaned, what applications could you imagine? How could they be useful?
* What other data might be useful? What additional analysis could you do with additional data?

Discuss your ideas with another participant or somebody from the teaching team.

In [None]:
MPs can be categorized based on the vectorized text_columns. Topics and reasonings of votes can be compared to highlight 
the sensitivities of MPs to proposals. 
With historical data, MPs priorities/choices can be studied through time.