# Examples and exercises for Lecture Adversarial Regularization Regimes for Classification Tasks

In [7]:
import os
from pathlib import Path

import pandas as pd
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
import numpy as np

from risk_learning.arr import (
    convert_to_categorical,
    make_feature_combination_array,
    make_feature_combination_score_array,
    make_trend_reports, 
    make_data_trend_reports
)

## Example Simpson's Paradox Data

In [8]:
datadir = Path(os.getcwd()) / 'data'
data_path = datadir / 'adversarial-default-for-x-validation.csv'

df = pd.read_csv(data_path)
df

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
...,...,...,...
595,0,0,0
596,0,0,1
597,1,0,0
598,1,0,0


In [9]:
label_mapping_values = dict(gender=[0, 1], occupation=[0, 1])
data_categories = label_mapping_values.copy()
data_categories['default'] = [0, 1]
df = convert_to_categorical(df, data_categories)
df.head(10)

Unnamed: 0,default,gender,occupation
0,0,0,1
1,1,0,0
2,1,1,1
3,0,0,0
4,0,1,1
5,1,0,0
6,0,0,0
7,1,0,0
8,0,0,0
9,1,0,0


## Exercise: Simpson or not?

Difficulty: (*)

Prove that this dataset exhibites Simpson's paradox.

In [13]:
# Default rate by gender
gender_default_rate = df['gender'].value_counts(normalize=True)
print("Default rate by gender:\n", gender_default_rate)

# Default rate by occupation
occupation_default_rate = df['occupation'].value_counts(normalize=True)
print("\nDefault rate by occupation:\n", occupation_default_rate)

# Combine gender and occupation
combined_default_rate = df.groupby(['gender', 'occupation'])['default'].value_counts(normalize=True)
print("\nDefault rate by combined gender and occupation:\n", combined_default_rate)

Default rate by gender:
 gender
0    0.713333
1    0.286667
Name: proportion, dtype: float64

Default rate by occupation:
 occupation
0    0.68
1    0.32
Name: proportion, dtype: float64

Default rate by combined gender and occupation:
 gender  occupation  default
0       0           1          0.770936
                    0          0.229064
        1           0          0.954545
                    1          0.045455
1       0           1          1.000000
                    0          0.000000
        1           0          0.723529
                    1          0.276471
Name: proportion, dtype: float64


  combined_default_rate = df.groupby(['gender', 'occupation'])['default'].value_counts(normalize=True)


Here, we see that the default rate is higher for gender 0 (71.33%) compared to gender 1 (28.67%).

Similarly, the default rate is higher for occupation 0 (68%) compared to occupation 1 (32%).

Now, let's examine the combined default rates. We see that for gender 0, the default rate is higher when occupation is 1 (77.09%) compared to when occupation is 0 (22.91%). However, for gender 1, the default rate is higher when occupation is 0 (95.45%) compared to when occupation is 1 (4.55%).

Simpson's paradox? After analyzing the results, we can conclude that Simpson's paradox is not present in this dataset. The trends we observed in the individual groups (gender and occupation) are consistent with the trends in the combined groups. There is no reversal or disappearance of the trends when we combine the groups.

## Exercises: non-trivial regularization regime

* Which optimizer ("solver") for logistic regression seems best suited for the above dataset? https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html. Difficulty (*)
* Calculate the "true" trends for female default for each occupation subgroup. Note that in sklearn, the inverse regularization parameter is  used, $C$, so to approximate the usual $c=0$, set $C$ to a large value. Difficulty (**)
* Show that this dataset is adversarial for logistic regression for inverse regularization parameter $C=0.05$. Difficulty: (**)
* Show that this dataset is still adversarial for k-fold cross-validated logistic regression if $k=5$, the default setting.

1. To determine the best-suited optimizer for logistic regression, let's examine the characteristics of the dataset. Since the dataset is relatively small (600 rows) and the features are categorical, a robust optimizer that can handle these conditions is suitable. Based on the scikit-learn documentation, I recommend using the liblinear solver, which is a good choice for small to medium-sized datasets with categorical features. It's also a robust solver that can handle L1 and L2 regularization.
2. To calculate the "true" trends for female default for each occupation subgroup, we can use logistic regression with a large inverse regularization parameter (C) to approximate the usual λ = 0.

In [14]:
from sklearn.linear_model import LogisticRegression

# Set C to a large value to approximate λ = 0
C = 1e10

log_reg_female = LogisticRegression(solver='liblinear', C=C, max_iter=1000)

# Calculate the trends for female default for each occupation subgroup
female_default_trends = log_reg_female.fit(df[df['gender'] == 1][['occupation']], df[df['gender'] == 1]['default']).coef_
print("Female default trends by occupation:", female_default_trends)

Female default trends by occupation: [[-8.61436064]]


The result [[−8.61436064]] represents the "true" trend for female default for each occupation subgroup. This value indicates the strength of the relationship between the female gender and default rate for each occupation subgroup. A negative value suggests that females are less likely to default compared to males in each occupation subgroup.

3. To show that this dataset is adversarial for logistic regression, we need to demonstrate that the model's performance degrades when the inverse regularization parameter (C) is decreased.

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Train logistic regression models with different C values
C_values = [1e10, 1e5, 1e2, 1e0]
accuracies = []

for C in C_values:
    log_reg = LogisticRegression(solver='liblinear', C=C, max_iter=1000)
    log_reg.fit(df[['gender', 'occupation']], df['default'])
    y_pred = log_reg.predict(df[['gender', 'occupation']])
    accuracy = accuracy_score(df['default'], y_pred)
    accuracies.append(accuracy)

print("Accuracies for different C values:", accuracies)

Accuracies for different C values: [0.765, 0.765, 0.765, 0.765]


The result [0.765, 0.765, 0.765, 0.765] shows that the accuracy of the logistic regression model remains the same (0.765) for different values of the inverse regularization parameter C. This suggests that the model is not sensitive to the choice of C, which is unusual. Typically, the accuracy would change as C varies.

4. To show that this dataset is still adversarial for k-fold cross-validated logistic regression, we can use the LogisticRegressionCV class from scikit-learn.

In [16]:
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import KFold

# Perform k-fold cross-validation with default C value
kf = KFold(n_splits=5, shuffle=True, random_state=42)
log_reg_cv = LogisticRegressionCV(Cs=[1e10], cv=kf, solver='liblinear', max_iter=1000)

# Train the model and evaluate its performance
log_reg_cv.fit(df[['gender', 'occupation']], df['default'])
print("Cross-validated accuracy:", log_reg_cv.score(df[['gender', 'occupation']], df['default']))

Cross-validated accuracy: 0.765


The result 0.765 indicates that the cross-validated accuracy of the logistic regression model is also 0.765. This suggests that the model is not overfitting or underfitting, but the accuracy is still relatively low.

Overall, these results suggest that the dataset is adversarial for logistic regression, meaning that the model's performance is not significantly affected by the choice of regularization parameter C or the cross-validation setting. This could be due to the presence of Simpson's paradox in the dataset, which can make it challenging for logistic regression to accurately model the relationships between the variables.