# Project 5: Data Science & Machine Learning on Votings of the Swiss National Council

In project 5, we will analyze the voting behavior of the "Nationalrat" of the Swiss parliament in a number of ways. The project consists of 3 different files:

* Data Preparation: Prepare the data for the other two notebooks.
* Voting Predictions (this notebook): Predict the voting behavior of individual members or the entire council.
* Unsupervised: Find lower-dimensional representations of the voting behavior and groups of members of parliament.

**Make sure to have run the data preparation notebook before running this one!**

# Preparations
We start with the usual preparations.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re # regular expressions
from datetime import datetime # to calculate the age

In [3]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [4]:
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.neural_network import MLPRegressor

## Data Loading

First, we load the processed data sets. To see the detailed data preprocessing, please refer to Project 5 - Data Preparation.

In [5]:
file_path_ss24_nr_root = 'Abstimmungen_NR_2024SS_DE'

df_nr_numeric_info = pd.read_csv(file_path_ss24_nr_root + '_numeric_info.csv', index_col='Reference ID')
df_nr_vectorized_text_info = pd.read_csv(file_path_ss24_nr_root + '_vectorized_text_info.csv', index_col='Reference ID')
df_nr_all_info = pd.read_csv(file_path_ss24_nr_root + '_all_info.csv', index_col='Reference ID')
df_nr_cast_votes = pd.read_csv(file_path_ss24_nr_root + '_cast_votes.csv', index_col='Reference ID')

In [6]:
df_nr_numeric_info

Unnamed: 0_level_0,Topic Number,Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation,Percent_Yes,APK-NR | APK-SR,APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR,...,BK,EDA,EDI,EFD,EJPD,Parl,UVEK,Unknown.1,VBS,WBF
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,20240031,1,145,48,3,3,0,0.751295,0,0,...,0,0,0,0,0,0,0,0,0,1
28789,20210504,1,126,62,0,3,8,0.670213,0,0,...,0,0,0,0,1,0,0,0,0,0
28790,20210504,1,127,62,1,3,6,0.671958,0,0,...,0,0,0,0,1,0,0,0,0,0
28792,20230057,1,122,65,1,3,8,0.652406,0,0,...,0,0,0,0,1,0,0,0,0,0
28793,20230057,1,191,0,0,3,5,1.000000,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,20230070,1,195,3,0,1,0,0.984848,0,0,...,0,0,0,0,1,0,0,0,0,0
29250,20230077,1,145,45,7,1,1,0.763158,0,0,...,0,0,0,1,0,0,0,0,0,0
29251,20230080,1,196,1,1,1,0,0.994924,0,0,...,0,0,0,1,0,0,0,0,0,0
29252,20230084,1,197,0,1,1,0,1.000000,0,0,...,0,0,0,0,0,0,0,0,0,1


In [7]:
df_nr_all_info

Unnamed: 0_level_0,Topic Number,Council Decision,Number of Yes,Number of No,Anzahl Enthaltungen,Number of excused,Number of non-participation,Percent_Yes,APK-NR | APK-SR,APK-NR | APK-SR | FK-NR | FK-SR | GPK-N | GPK-S | KVF-NR | KVF-SR | RK-NR | RK-SR | SGK-NR | SGK-SR | SiK-NR | SiK-SR | SPK-NR | SPK-SR | UREK-NR | UREK-SR | WAK-NR | WAK-SR | WBK-NR | WBK-SR,...,Proposal Title_obligatorische,Proposal Title_schweizerisches,Proposal Title_stipendien,Proposal Title_studierende,Proposal Title_und,Proposal Title_von,Proposal Title_zivilgesetzbuch,Proposal Title_zusammenarbeit,Proposal Title_änderung,Proposal Title_über
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,20240031,1,145,48,3,3,0,0.751295,0,0,...,0,0,0,0,0,0,0,0,0,1
28789,20210504,1,126,62,0,3,8,0.670213,0,0,...,0,0,0,0,3,0,0,0,0,2
28790,20210504,1,127,62,1,3,6,0.671958,0,0,...,0,0,0,0,3,0,0,0,0,2
28792,20230057,1,122,65,1,3,8,0.652406,0,0,...,0,1,0,0,0,0,1,0,0,0
28793,20230057,1,191,0,0,3,5,1.000000,0,0,...,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,20230070,1,195,3,0,1,0,0.984848,0,0,...,0,0,0,0,2,1,0,0,0,2
29250,20230077,1,145,45,7,1,1,0.763158,0,0,...,0,0,0,0,1,0,0,0,1,1
29251,20230080,1,196,1,1,1,0,0.994924,0,0,...,0,0,0,0,2,0,0,0,0,1
29252,20230084,1,197,0,1,1,0,1.000000,0,0,...,1,0,0,0,1,0,0,0,0,1


In [8]:
df_nr_cast_votes

Unnamed: 0_level_0,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023","4049 | Aebischer, Matthias | NR | S | BE | 18.10.1967 | 04.12.2023 | 04.12.2023","10803 | Aellen, Cyril | NR | RL | GE | 29.02.1972 | 04.12.2023 | 04.12.2023","4053 | Aeschi, Thomas | NR | V | ZG | 13.01.1979 | 04.12.2023 | 04.12.2023","10812 | Alijaj, Islam | NR | S | ZH | 18.06.1986 | 04.12.2023 | 04.12.2023","4090 | Amaudruz, Céline | NR | V | GE | 15.03.1979 | 04.12.2023 | 04.12.2023","4320 | Amoos, Emmanuel | NR | S | VS | 31.07.1980 | 04.12.2023 | 04.12.2023","4245 | Andrey, Gerhard | NR | G | FR | 21.01.1976 | 04.12.2023 | 04.12.2023","4184 | Arslan, Sibel | NR | G | BS | 23.06.1980 | 04.12.2023 | 04.12.2023","4246 | Badertscher, Christine | NR | G | BE | 11.01.1982 | 04.12.2023 | 04.12.2023",...,"4298 | Weichelt, Manuela | NR | G | ZG | 21.07.1967 | 04.12.2023 | 04.12.2023","4057 | Wermuth, Cédric | NR | S | AG | 19.02.1986 | 04.12.2023 | 04.12.2023","4299 | Wettstein, Felix | NR | G | SO | 19.01.1958 | 04.12.2023 | 04.12.2023","4300 | Widmer, Céline | NR | S | ZH | 26.05.1978 | 04.12.2023 | 04.12.2023","4305 | Wismer-Felder, Priska | NR | M-E | LU | 02.10.1970 | 04.12.2023 | 04.12.2023","4318 | Wyss, Sarah | NR | S | BS | 03.08.1988 | 04.12.2023 | 04.12.2023","10846 | Wyssmann, Rémy | NR | V | SO | 20.06.1967 | 04.12.2023 | 04.12.2023","10851 | Zryd, Andrea | NR | S | BE | 24.10.1975 | 04.12.2023 | 04.12.2023","4179 | Zuberbühler, David | NR | V | AR | 20.02.1979 | 04.12.2023 | 04.12.2023","10822 | Zybach, Ursula | NR | S | BE | 29.08.1967 | 04.12.2023 | 04.12.2023"
Reference ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
28659,1,1,1,1,-1,1,-1,-1,-1,-1,...,-1,-1,-1,-1,1,-1,1,0,1,1
28789,-1,1,1,-1,1,-1,1,1,1,1,...,1,1,1,1,1,1,-1,1,-1,1
28790,1,-1,1,1,-1,1,-1,-1,-1,-1,...,-1,-1,-1,-1,1,-1,1,-1,1,-1
28792,1,1,-1,1,1,1,1,1,1,1,...,1,1,1,1,-1,1,1,1,1,1
28793,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29249,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
29250,-1,1,1,-1,1,1,1,1,1,1,...,1,1,1,1,1,1,-1,1,-1,1
29251,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
29252,1,1,1,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


# Prediction of Individual Voting Behavior
We will now try to predict the voting behavior per member of parliament. To start, we will choose one member, and train a classifier for that person based on the available information of the proposal. Later on, we will build a classifier to predict the voting behavior of every person based on the characteristics of that person and on the available information of the proposal.

## Building a Member-Specific Classifier
We use the first member in the table as an example, and train a model to predict the voting behavior of that person. For no particular reason, we choose the member with index 0. You can of course try with any other member of parliament.

The cast votes of that person will be the target value we try to predict, which is typically denoted by `y`. We therefore call this target variable `Y_selected_member`:

In [12]:
Y_selected_member = pd.DataFrame(df_nr_cast_votes.iloc[:,0])
Y_selected_member

Unnamed: 0_level_0,"4154 | Addor, Jean-Luc | NR | V | VS | 22.04.1964 | 04.12.2023 | 04.12.2023"
Reference ID,Unnamed: 1_level_1
28659,1
28789,-1
28790,1
28792,1
28793,1
...,...
29249,1
29250,-1
29251,1
29252,1


In the following, we will try to predict these values first based only on the text columns (i.e. using the vectorized text data), and later also using the additional information. 

### Prediction based only on Text Columns
We first have to prepare the data such that we can train and evaluate the model. Therefore, we go through a typical machine learning pipeline:
- Split the dataset
- Train the model
- Evaluate the model

When splitting the dataset, we use the simplest way: randomly choose 20% of the whole dataset to be the test dataset. The splitting could be done in a more refined way.

**Optional Exercise:** What are the implications of this way of splitting the data? What alternatives could we use, with what benefits?

A random choice of 20% may bring some biases, if certain topics are overrepresented/underrepresented in the test dataset. We should categorize the whole dataset and randomly select 20% for each category of subjects.

Furthermore, to simplify things a bit, we will only consider proposals where the selected proposal did vote yes or no, and will ignore cases of abstention or absence.

Technical note: `Y_selected_member` is a data frame. The comparison `Y_selected_member!= 0` will again yield a data frame. In order to use the corresponding `True` and `False` values, we have to extract the `values` to get an array we can then use for logical indexing.

In [13]:
proposals_where_selected_member_voted = (Y_selected_member!= 0).values

Remember that the prediction will be done based on `df_nr_vectorized_text_info`. For both the features and the target value, we have to choose only the rows (proposals) where the selected member voted.

We use the well-known function `train_test_split` to get the training and test data:

In [16]:
# We split the dataset into training and test dataset.
X_selected_member_train, X_selected_member_test, Y_selected_member_train, Y_selected_member_test = \
  train_test_split(df_nr_vectorized_text_info[proposals_where_selected_member_voted],
                   Y_selected_member[proposals_where_selected_member_voted],
                   test_size=0.2,
                   random_state=42)

#### Logistic Regression

**Exercise:** Train a logistic regression model, and evaluate it. Comment on the result.

In [27]:
lr_classifier = LogisticRegression( max_iter=10000)

lr_classifier.fit(X_selected_member_train, Y_selected_member_train)
y_selected_member_pred = lr_classifier.predict(X_selected_member_train)

accuracy = accuracy_score(y_selected_member_pred, Y_selected_member_train)
print(f"accuracy on training set = {accuracy}")

y_selected_member_pred_test = lr_classifier.predict(X_selected_member_test)

print(f"accuracy on test set = {accuracy_score(y_selected_member_pred_test, Y_selected_member_test)}")
report = classification_report(Y_selected_member_test, y_selected_member_pred_test)
    
print(f"\nClassification Report (Test data):\n{report}")


accuracy on training set = 0.9007633587786259
accuracy on test set = 0.6818181818181818

Classification Report (Test data):
              precision    recall  f1-score   support

          -1       0.71      0.77      0.74        39
           1       0.62      0.56      0.59        27

    accuracy                           0.68        66
   macro avg       0.67      0.66      0.66        66
weighted avg       0.68      0.68      0.68        66



  y = column_or_1d(y, warn=True)


### Function to Train and Evaluate Classifiers
As we will be evaluating several classifiers, we define a function to train and evaluate a model. It is very similar to the cell above:

In [20]:
def train_apply_eval_model_classification(model, X_train, y_train, X_test, y_test):
    """
    Train a given model on a training data set, and evaluate it on both the training and test data.

    Arguments:
    - model: the model to be evaluated
    - X_train: the training predictors
    - y_train: the true labels of the training data set
    - X_test: the predictors of the test data set
    - y_test: the true labels of the test data set
    """

    # If we are entering a dataframe as target values, we get a warning.
    # The line below fixes this.
    
    # Train the model:
    if isinstance(y_train, pd.DataFrame):
        model.fit(X_train, y_train.values.squeeze())
    else:
        model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Evaluate the model
    accuracy_train = accuracy_score(y_train, y_pred_train)
    accuracy_test = accuracy_score(y_test, y_pred_test)
    report = classification_report(y_test, y_pred_test)
    
    print(f'Accuracy Train: {accuracy_train}')
    print(f'Accuracy Test: {accuracy_test}')
    print(f"\nClassification Report (Test data):\n{report}")

### Predictions based on all Columns
Next, we will use all information (i.e., both the numerical and the text data) of the voting proposal to predict the voting behavior of our selected member of parliament. We already have this stored in the variable `df_nr_all_info`. As above, we need to make sure that we only use voting proposals where the selected person did vote either "yes" or "no".

**Exercise:** split the full data `into X_selected_member_train`, `X_selected_member_test`, `Y_selected_member_train`, `Y_selected_member_test` using `train_test_split`:

In [25]:
# X_selected_member_train, X_selected_member_test, Y_selected_member_train, Y_selected_member_test = \
#   train_test_split( ... )

Y_selected_member_all_info = pd.DataFrame(df_nr_all_info.iloc[:,0])
Y_selected_member_all_info
proposals_where_selected_member_all_voted = (Y_selected_member_all_info!= 0).values

# We split the dataset into training and test dataset.
X_selected_member_all_train, X_selected_member_all_test, Y_selected_member_all_train, Y_selected_member_all_test = \
  train_test_split(df_nr_vectorized_text_info[proposals_where_selected_member_all_voted],
                   Y_selected_member_all_info[proposals_where_selected_member_all_voted],
                   test_size=0.2,
                   random_state=42)


#### Logistic Regression
Again, we will run a logistic regression:

In [26]:
logistic_regression = LogisticRegression()

train_apply_eval_model_classification(logistic_regression, X_selected_member_train, Y_selected_member_train, 
                                      X_selected_member_test, Y_selected_member_test)

Accuracy Train: 0.9007633587786259
Accuracy Test: 0.6818181818181818

Classification Report (Test data):
              precision    recall  f1-score   support

          -1       0.71      0.77      0.74        39
           1       0.62      0.56      0.59        27

    accuracy                           0.68        66
   macro avg       0.67      0.66      0.66        66
weighted avg       0.68      0.68      0.68        66



**Exercise:** Comment on the above results. In particular, do you think we should add a regularisation (e.g., LASSO)? Or would you recommend another way to improve the performance?

**Exercise:** Implement a way to improve the results of the logistic regression classifier (but stay with this technique for now).

In the following cell, we get the coefficients of each of the attributes, and we sort the attributes along the coefficient value. 

We also compute the ***Odds Ratio***, which is another way to quantify probabilities. If we choose a random day, the chances are 2/7 (or 28.57%) that this day will be a weekend day, and 5/7 that it will be a work day. Here, 2/7 and 5/7 are the **probabilities** of the two events (i.e., the event that the chosen day is a weekend day, or a work day, respectively). In terms of **odds**, one would say that the **odds ratio** is 2 to 5 for a weekend day (because there are 2 events that would make the chosen day a weekend day (namely, 'the chosen day is a Saturday', and 'the chosen day is a Sunday'), and 5 events that make the chosen day a working day (namely, 'the chosen day is a Monday', ..., 'the chosen day is a Friday'), and all these events are considered equally probable (as we have chosen the day at random). Sometimes, the odds ratio is also expressed as a probability, i.e. the odds ratio for a weekend day is 40% (2/5).

We then print the resulting data frame to see the attributes that are most and least in favor of a 1 (i.e., a vote YES):

In [None]:
# Get the coefficients
coefficients = logistic_regression.coef_[0]  # model.coef_ is a 2D array; [0] gets the coefficients for the first class if binary

# Get the intercept (bias term)
intercept = logistic_regression.intercept_[0]

feature_names = X_selected_member_train.columns

# Create a DataFrame to display the coefficients with feature names
coeff_df = pd.DataFrame({
    'Feature': feature_names,
    'Coefficient': coefficients
})

coeff_df['Odds_Ratio'] = np.exp(coeff_df['Coefficient'])

# Sort the coefficients by their absolute value to see which features are most influential
coeff_df = coeff_df.sort_values(by='Coefficient', ascending=False)

# Display the DataFrame
print(coeff_df)

**Exercise:** Interprete these results, and discuss possible limitiations of this interpretation

### Addressing Overfitting
**Exercise:** Modify the logistic regression code above to reduce overfitting. To do so, look at regularization methods. Note that for `LogisticRegression`, you might have to specify the solver. In particular, if you want to use logistic regression with an `l1` penalty, call

`LogisticRegression(penalty='l1', solver='liblinear')`

For more information, check https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html 

#### Random Forest

In [None]:
random_forest = RandomForestClassifier(random_state=42)

train_apply_eval_model_classification(random_forest, X_selected_member_train, Y_selected_member_train, 
                                      X_selected_member_test, Y_selected_member_test)

**Exercise:** Comment on the above results. In particular, do you think we should add a regularisation? Or would you recommend another way to improve the performance?

**Optional Exercise:** Implement your recommended way to improve the performance.

## Building a Classifier for All Members of Parliament

Next, we will use the information available about the members of the parliamant in order to predict their voting behavior. While we will use the same information about the proposals as above, we will use the information about the members of parliament to try to predict a given persons' cast votes. In particular, the information we use about the members of parliaments are the *parliamentary group* (*Fraktion* in German), the *canton* the person represents, and the age (which we derive from the date of birth).

### Data Preparation
The below cells do this transformation. The details are rather technical and not necessary for the rest of the project.

In [None]:
Y_all_members_reset = df_nr_cast_votes.reset_index()

# create long format of data
Y_long_with_MPinfo = pd.melt(Y_all_members_reset, id_vars=['Reference ID'], 
                             var_name='Person', value_name='Vote')

# extract information about members in separate columns
person_split = Y_long_with_MPinfo['Person'].str.split('|', expand=True)
Y_long_with_MPinfo[['Person ID', 'Name', 'Chamber', 'Parl_Group', 'Canton', 'Birthday', 'Swear-in date 1',
                    'Swear-in date 2']] = person_split.apply(lambda x: x.str.strip())

Y_long_with_MPinfo.head()

In [None]:
# transform the column "Birthday" to datetime format
Y_long_with_MPinfo['Birthday'] = pd.to_datetime(Y_long_with_MPinfo['Birthday'], format='%d.%m.%Y', dayfirst=True)

# Calculate age of member in current session
today = datetime.today()
Y_long_with_MPinfo['Age'] = Y_long_with_MPinfo['Birthday'].apply(lambda x: today.year - x.year - ((today.month, today.day) < (x.month, x.day)))

# Now we can drop the original "Person" column and other columns we don't need
Y_long_with_MPinfo.drop(columns=['Person', 'Name', 'Chamber', 'Birthday', 'Swear-in date 1', 'Swear-in date 2'], inplace=True)
Y_long_with_MPinfo.head()

Now, we have a table that contains the vote, the fraction, the represented canton and the age of every member of parliament, and for every voting proposal (`Reference ID`).

Next, we need to transform the categorical attributes `Parl_Group` and `Canton` into dummy variables.

In [None]:
Y_long_with_MPinfo_dummy = pd.get_dummies(Y_long_with_MPinfo, columns=['Parl_Group', 'Canton'], prefix=['PG', 'Canton']).astype(int)
Y_long_with_MPinfo_dummy

Now, we can combine `Y_long_with_MPinfo_dummy` with `df_nr_all_info` in order to get a table that contains all the information of the proposal, and all the information about the person, for every proposal and every person that voted yes or no. Note that this is a highly redundant data representation that is optimized for our prediction task.

In [None]:
df_nr_all_info_wMPinfo = pd.merge(Y_long_with_MPinfo_dummy, df_nr_all_info, left_on='Reference ID', right_on='Reference ID', how='left')
df_nr_all_info_wMPinfo

In [None]:
# we drop all rows that have a label = 0, which indicates that the member did not vote
rows_without_zero_label = (df_nr_all_info_wMPinfo['Vote'] != 0)
df_nr_all_info_wMPinfo = df_nr_all_info_wMPinfo[rows_without_zero_label]
df_nr_all_info_wMPinfo['Vote'].value_counts()

Next, we have to split our data into training and test data. 

**Exercise:** What could be different criteria, or different ways to do the splitting in this scenario? For each of the ways to split the data, what would be an application case?

We now split the data along the members of parliament, i.e., 20% of the persons will be selected, and all their votes will be used for the test dataset. All the remaining people and their votes are used as the training set.

In [None]:
unique_member_ids = df_nr_all_info_wMPinfo['Person ID'].unique()
train_ids, test_ids = train_test_split(unique_member_ids, test_size=0.2, random_state=42)

df_nr_all_info_wMPinfo_train = df_nr_all_info_wMPinfo[df_nr_all_info_wMPinfo['Person ID'].isin(train_ids)]
df_nr_all_info_wMPinfo_test  = df_nr_all_info_wMPinfo[df_nr_all_info_wMPinfo['Person ID'].isin(test_ids)]

X_nr_all_info_wMPinfo_train = df_nr_all_info_wMPinfo_train.drop(columns=['Vote'])
y_nr_all_info_wMPinfo_train = df_nr_all_info_wMPinfo_train['Vote']

X_nr_all_info_wMPinfo_test = df_nr_all_info_wMPinfo_test.drop(columns=['Vote'])
y_nr_all_info_wMPinfo_test = df_nr_all_info_wMPinfo_test['Vote']

Now we are ready to train different models.

### Logistic Regression

In [None]:
logistic_regression = LogisticRegression(penalty=None, random_state=42, max_iter=1000)

train_apply_eval_model_classification(logistic_regression, X_nr_all_info_wMPinfo_train, y_nr_all_info_wMPinfo_train, 
                                      X_nr_all_info_wMPinfo_test, y_nr_all_info_wMPinfo_test)

**Exercise:** How could we improve the performance of this model? 

*Hint:* You might want to check the scale of the inputs to the regression model.

**Exercise**: Apply at least one other classification technique, and discuss the results in comparison to the result obtained using logistic regression.

# Regression

In this part, we are doing regression to predict the acceptance ratio of a given subject.

The target value is `Percent_Yes` in the dataframe `df_nr_all_info`; i.e. we will use all the available information about the proposal being voted on. We will not use any information about the members of parliament, because we only consider data from one session (Summer 2024), so the members of parliament do not change, and we discard the information about who participated in the vote.

* Try (define, train and evaluate) different regression methods
* Logistic regression might be an interesting option, as it yields a prediction result that fits into the target range 0...1 (0-100%)

## Data Preparation
We will derive the features and target value from `df_nr_all_info`.

* The target value, as already mentioned, is `df_nr_all_info['Percent_Yes']`.
* The features used are all other features of `df_nr_all_info` except `Percent_Yes`, and some features that are closely related to it, such as the number of yes and no votes.

In [None]:
Y_overall_4reg = df_nr_all_info['Percent_Yes']
Y_overall_4reg.head()

In [None]:
X_overall_4reg = df_nr_all_info.drop(columns=['Percent_Yes'])

X_overall_4reg

Next, we split the data into a training and test set. This split is done independently for each proposal, which is somewhat a simplification of the actual political process, as some proposals might depend on each other.

In [None]:
X_overall_4reg_train, X_overall_4reg_test, Y_overall_4reg_train, Y_overall_4reg_test = \
    train_test_split(X_overall_4reg, Y_overall_4reg, test_size=0.2, random_state=42)

## Evaluation of Regression Methods
In this section, we will compare different regression models.

* Given that we are predicting a continuous variable, linear regression seems to be the default starting point.
* As we are predicting a value between 0 and 1, the idea of logistic regression might sound appealing, as this would directly ensure that the output values are between 0 and 1.
* Finally, we try a more complex neural network to evaluate the performance of a model with higher degree of freedom.

### Function for Model Training and Evaluation
As we will evaluate several models, we again define a function to summarize these steps:

In [None]:
def train_apply_eval_model_regression(model, X_train, y_train, X_test, y_test):
    """
    Train a given model on a training data set, and evaluate it on both the training and test data.

    Arguments:
    - model: the model to be evaluated
    - X_train: the training predictors
    - y_train: the true labels of the training data set
    - X_test: the predictors of the test data set
    - y_test: the true labels of the test data set
    """

    # if we have a neural network model, we first have to compile the model, and the fitting method needs more arguments.
    if 'keras' in str(type(model)) and 'Sequential' in str(type(model)):
        # Compile the model. This means to combine necessary components together. You must compile it before start training.
        model.compile(
            optimizer='adam',
            loss='mean_squared_error',
            metrics=['mean_squared_error', 'r2_score']
            )

        # Train the model
        history = model.fit(
            X_train,
            y_train,
            epochs=20,
            batch_size=16,
            verbose=1
        )

    else:
        # we can just call 'fit' to train the model:
        model.fit(X_train, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    # Evaludate the model
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    r2_train = r2_score(y_train, y_pred_train)
    r2_test = r2_score(y_test, y_pred_test)
    
    print(f'MAE Train: {mae_train}')
    print(f'MAE Test: {mae_test}')    
    print(f'R2 Train: {r2_train}')
    print(f'R2 Test: {r2_test}')

Now we are ready to evaluate the performance of different regression models. 

### Linear Regression
As a first trial, we run a linear regression model:

In [None]:
linear_regression = LinearRegression()

train_apply_eval_model_regression(linear_regression, X_overall_4reg_train, Y_overall_4reg_train, X_overall_4reg_test, Y_overall_4reg_test)

**Exercise:** 

Comment on the results of the above cell. 

*Hint*: If you think it looks too good to be true, you are on the right track ;-) Can you spot the issue?

Fix the issue you spotted above, and re-run the regression. Comment on the result. 

**Exercises:**
How can you improve the performance of the linear regression? Use the techniques discussed in the course to do so, and discuss the results.

**Optional Exercises:**

Propose and implement one possible improvement over the standard linear regression. Identify a shortcoming of the solution of the linear regression above, and argue why / how your improvement could address it. Double-check your expectation with the result.

Two hints and ideas:

* We have introduced random forest classifiers in class. Random forests can also be used for regression, and `scikit-learn` provides a class `RandomForestRegressor` with the usual functions (`fit(...)`, `predict(...)`) that you know from other methods. For details, check https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* (Advanced): The outputs are in a clearly defined range, but linear regression does not make use of this information. In `scikit-learn`, there is no function to take this information into account. However, you can build a neural network to do so. Start with a neural network for linear regression, and then extend this one. Experiment with different network architectures (number of layers, number of neurons, regularization) to find a good model. Comment on your findings.