# HW2 - Bias in Data and Prediction - DSCI 531 - Spring 2024

### Please complete the code or analysis under "TODO". 100pts in total. You should run every cell and keep all the outputs before submitting. Failing to include your outputs will result in zero points.
### Please keep academic integrity in mind. Plagiarism will be taken seriously.

In [1]:
import numpy as np
import pandas as pd

## 1. Implement Utility Functions

### 1.1 Fairness Metrics

In [2]:
# You are NOT allowed to use off-the-shelf fairness packages like ai360

def stat_parity(preds, sens):
    '''
    :preds: numpy array of the model predictions. Consisting of 0s and 1s
    :sens: numpy array of the sensitive features. Consisting of 0s and 1s
    :return: the statistical parity. no need to take the absolute value
    '''
    
    count_one = 0 # Positive Prediction for 0 of sens
    count_zero = 0 # Positive Prediction for 1 of sens
    num_one = sum(sens) # number of class 1 of sens
    num_zero = len(sens) - num_one # number of class 0 of sens
    
    for i in range(len(sens)):
        if sens[i]==0: # case of 0 of sens
            if preds[i]==1:
                count_zero +=1
            else:
                continue
        else: # case of 1 of sens
            if preds[i]==1:
                count_one +=1
            else:
                continue
                
                
    if num_one != 0:
        
        parity1 = count_one/num_one
    else:
        parity1 =0 
        
    if num_zero != 0:
        
        parity0 = count_zero/num_zero
    else:
        parity0 =0 
        
    parity = parity1-parity0
    # TODO. 7.5pts
    return parity


def eq_oppo(preds, sens, labels):
    '''
    :preds: numpy array of the model predictions. Consisting of 0s and 1s
    :sens: numpy array of the sensitive features. Consisting of 0s and 1s
    :labels: numpy array of the ground truth labels of the outcome. Consisting of 0s and 1s
    :return: the statistical parity. no need to take the absolute value
    '''
    
    TP_one = np.sum((preds == 1) & (sens == 1) & (labels == 1))
    AP_one = np.sum((sens == 1) & (labels == 1)) # this will include False Negative because FN is actually positive

    TP_zero = np.sum((preds == 1) & (sens == 0) & (labels == 1))
    AP_zero = np.sum((sens == 0) & (labels == 1))
    
    if AP_one != 0:
        
        eq_one = TP_one/AP_one
    else:
        eq_one = 0
        
    if AP_zero!=0:
        
        eq_zero = TP_zero/AP_zero
    else:
        eq_zero = 0
    
    # TODO. 7.5pts
    return eq_one-eq_zero

In [3]:
# Test your implemented fairness metrics using the code below
# Don't change the code in this cell

# test case 1
preds = np.array([1, 0, 1, 0, 0, 1, 0, 0, 0, 1])
sens = np.array([1, 1, 0, 1, 1, 1, 0, 1, 1, 1])
labels = np.array([0, 1, 0, 1, 0, 1, 1, 1, 0, 1])
print(eq_oppo(preds, sens, labels), stat_parity(preds, sens))

# test case 2
preds = np.array([1, 1, 0, 1, 0, 1, 0, 0, 1, 1])
sens = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1])
labels = np.array([0, 1, 0, 1, 0, 1, 1, 0, 0, 0])
print(eq_oppo(preds, sens, labels), stat_parity(preds, sens))


# test case 3
preds = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1])
sens = np.array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1])
labels = np.array([0, 1, 0, 1, 0, 1, 1, 0, 0, 0])
print(eq_oppo(preds, sens, labels), stat_parity(preds, sens))

0.4 -0.125
-0.75 0.5
0.0 1.0


### 1.2 Preprocessing DataFrame

In [36]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler 
def process_dfs(df_train_x, df_test_x, categ_cols):
    '''
    Pre-process the features of the training set and the test set, not including the outcome column.
    Convert categorical features (nominal & ordinal features) to one-hot encodings.
    Normalize the numerical features into [0, 1].
    We process training set and the test set together in order to make sure that 
    the encodings are consistent between them.
    For example, if one class is encoded as 001 and another class is encoded as 010 in the training set,
    you should follow this mapping for the test set too.
    
    :df_train: the dataframe of the training data
    :df_test: the dataframe of the test data
    :categ_cols: the column names of the categorical features. the rest features are treated as numerical ones.
    :return: the processed training data and test data, both should be numpy arrays, instead of DataFrames
    '''
    
    
    train_x = pd.get_dummies(df_train_x, columns = categ_cols, dtype=int)
    test_x = pd.get_dummies(df_test_x, columns = categ_cols, dtype=int)
    
    test_x = test_x.reindex(columns=train_x.columns, fill_value=0)
    
    
    
    scaler = MinMaxScaler()
    
    train_x = scaler.fit_transform(train_x)
    test_x = scaler.transform(test_x) # use the same scaler
        
    
    # TODO. 15pts
    return train_x, test_x

In [37]:
# Test your implemented data preprocessing function
# DO NOT change the code in this cell

df_train_x = pd.DataFrame([
    [ 'big', 10, 'blue',],
    [ 'big', 12, 'red',],
    ['medium', 5, 'blue'],
    ['small', 7, 'yellow']
], columns=['size', 'height', 'color'])

df_test_x = pd.DataFrame([
    [ 'big', 16, 'red',],
    ['small', 9, 'blue']
], columns=['size', 'height', 'color'])

train_data_x, test_data_x = process_dfs(df_train_x, df_test_x, categ_cols=['size', 'color'])
print(train_data_x)
print()
print(test_data_x)

[[0.71428571 1.         0.         0.         1.         0.
  0.        ]
 [1.         1.         0.         0.         0.         1.
  0.        ]
 [0.         0.         1.         0.         1.         0.
  0.        ]
 [0.28571429 0.         0.         1.         0.         0.
  1.        ]]

[[1.57142857 1.         0.         0.         0.         1.
  0.        ]
 [0.57142857 0.         0.         1.         1.         0.
  0.        ]]


In [38]:
test_data_x.shape

(2, 7)

In [39]:
train_data_x.shape

(4, 7)

## 2. Load Data

In [40]:
df_train_adult = pd.read_csv('adult-train.csv', sep=', ', engine='python')
df_test_adult = pd.read_csv('adult-test.csv', sep=', ', engine='python')
df_train_adult['sex'] = df_train_adult['sex'].map({'Male': 0, 'Female': 1})
df_test_adult['sex'] = df_test_adult['sex'].map({'Male': 0, 'Female': 1})
df_train_adult['income'] = df_train_adult['income'].map({'<=50K': 0, '>50K': 1})
df_test_adult['income'] = df_test_adult['income'].map({'<=50K': 0, '>50K': 1})


df_train_german = pd.read_csv('german-train.csv')
df_test_german = pd.read_csv('german-test.csv')
df_train_german['age'] = df_train_german['age'].apply(lambda x: 1 if x >= 33 else 0)
df_test_german['age'] = df_test_german['age'].apply(lambda x: 1 if x>=33 else 0)
df_train_german['credit_status'] = df_train_german['credit_status'].map({2:0, 1:1})
df_test_german['credit_status'] = df_test_german['credit_status'].map({2:0, 1:1})

In [41]:
df_train_adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba,0


In [42]:
df_train_german.head()

Unnamed: 0,checking_account,duration,credit_history,purpose,credit_amount,savings_account,present_employment_since,installment_rate,personal_status_sex,other_debtors,...,property,age,other_installment_plans,housing,num_credits,job,num_people_liable,telephone,foreign_worker,credit_status
0,A14,21,A32,A41,5248,A65,A73,1,A93,A101,...,A123,0,A143,A152,1,A173,1,A191,A201,1
1,A11,24,A32,A43,1987,A61,A73,2,A93,A101,...,A121,0,A143,A151,1,A172,2,A191,A201,0
2,A14,36,A32,A49,5742,A62,A74,2,A93,A101,...,A123,0,A143,A152,2,A173,1,A192,A201,1
3,A14,36,A32,A49,7409,A65,A75,3,A93,A101,...,A122,1,A143,A152,2,A173,1,A191,A201,1
4,A14,6,A34,A42,1221,A65,A73,1,A94,A101,...,A122,0,A143,A152,2,A173,1,A191,A201,1


## 3. Explore fairness in data

### 3.1 statical analysis on protected feature and outcome

In [46]:
# Adult
# calculate the mean income of two protected groups. only use the training data df_train_adult. 
# TODO. 3pts. The starter code below just indicate what you need to output in your code.
mean_income1_adult = np.mean(df_train_adult.loc[df_train_adult['sex']==0]['income'])
mean_income2_adult = np.mean(df_train_adult.loc[df_train_adult['sex']==1]['income'])

print(mean_income1_adult, mean_income2_adult)


# German
# calculate the mean credit status of two protected groups. only use the training data df_train_german. 
# TODO. 3pts. The starter code below just indicate what you need to output in your code.
mean_credit1_german = np.mean(df_train_german.loc[df_train_german['age']==0]['credit_status'])
mean_credit2_german = np.mean(df_train_german.loc[df_train_german['age']==1]['credit_status'])

print(mean_credit1_german, mean_credit2_german)

0.3138370951913641 0.11367818442036394
0.6636363636363637 0.7594594594594595


In [52]:
# t-test between outcome of two protected groups. only use the training data df_train_adult/german.
from scipy.stats import ttest_ind_from_stats

# Adult
# TODO. 1.5pts. The starter code below just indicate what you need to output in your code.
std_income1_adult = np.std(df_train_adult.loc[df_train_adult['sex']==0]['income'])
std_income2_adult = np.std(df_train_adult.loc[df_train_adult['sex']==1]['income'])
num_income1_adult = len(df_train_adult.loc[df_train_adult['sex']==0]['income'])
num_income2_adult = len(df_train_adult.loc[df_train_adult['sex']==1]['income'])
# p_value_adult = 

# pool_std = np.sqrt((std_income1_adult**2 + std_income2_adult**2) / 2)

t_stat_adult, p_value_adult = ttest_ind_from_stats(mean1=mean_income1_adult, std1=std_income1_adult, nobs1=num_income1_adult,
                                       mean2=mean_income2_adult, std2=std_income2_adult, nobs2=num_income2_adult,
                                       equal_var=True)


# # german
# # TODO. 1.5pts. The starter code below just indicate what you need to output in your code.
# p_value_german = 

std_credit1_german = np.std(df_train_german.loc[df_train_german['age']==0]['credit_status'])
std_credit2_german = np.std(df_train_german.loc[df_train_german['age']==1]['credit_status'])
num_credit1_german = len(df_train_german.loc[df_train_german['age']==0]['credit_status'])
num_credit2_german = len(df_train_german.loc[df_train_german['age']==1]['credit_status'])

t_stat_german, p_value_german = ttest_ind_from_stats(mean1=mean_credit1_german, std1=std_credit1_german, nobs1=num_credit1_german,
                                       mean2=mean_credit2_german, std2=std_credit2_german, nobs2=num_credit2_german,
                                       equal_var=True)
print(p_value_adult, p_value_german)

0.0 0.0049802626425791505


### From the p_values, are the results significant for Adult and German? How do you explain them?
### <span style="color:red">Please type your response here.</span> 3 pts
Since both p values for Adult and German t tests are smaller than 0.05, we can say that there is a significant difference between the two groups of Adult dataset. So does the two groups of German dataset. We reject the null hypothesis that there's no difference between two protected groups. 

Since the two groups for Adult t test is based on sex, that means there is bias against sex for Adult's income. Similarly, the two groups for German t test is based on Age, that means there could be bias against age for German's credit status.

### 3.2 Explore Fairness in Prediction

In [53]:
# Prepare data
# Dont't change code in this cell

'''
:train_x: the features in the training set (including the sensitive features), shape: N_train x d
:train_y: the outcome in the training set, shape: N_train
:test_x: the features in the test set (including the sensitive features), shape: N_test x d
:test_y: the outcome in the test set, shape: N_test
:test_sens: the sensitive/protected feature in the test set, shape: N_test
All of them are processed numpy arrays that are ready for algorithms.
'''


# adult
# the outcome (income) is the last column
df_train_x_adult = df_train_adult.iloc[:, :-1]
df_train_y_adult = df_train_adult.iloc[:, -1]
df_test_x_adult = df_test_adult.iloc[:, :-1]
df_test_y_adult = df_test_adult.iloc[:, -1]
df_test_sens_adult = df_test_adult['sex']

train_x_adult, test_x_adult = process_dfs(df_train_x_adult, df_test_x_adult, 
                                                   ['workclass', 'education','marital-status',
                                                    'occupation','relationship','race',
                                                    'native-country'])
train_y_adult = df_train_y_adult.values
test_y_adult = df_test_y_adult.values
test_sens_adult = df_test_sens_adult.values

# german
# the outcome (credit status) is the last column
df_train_x_german = df_train_german.iloc[:, :-1]
df_train_y_german = df_train_german.iloc[:, -1]
df_test_x_german = df_test_german.iloc[:, :-1]
df_test_y_german = df_test_german.iloc[:, -1]
df_test_sens_german = df_test_german['age']

train_x_german, test_x_german = process_dfs(df_train_x_german, df_test_x_german,
                                                     ['checking_account', 'credit_history', 
                                                      'purpose', 'savings_account', 'present_employment_since', 
                                                      'personal_status_sex', 'other_debtors',
                                                     'property', 'other_installment_plans',
                                                     'housing', 'job', 'telephone', 'foreign_worker'])
train_y_german = df_train_y_german.values
test_y_german = df_test_y_german.values
test_sens_german = df_test_sens_german.values

print(train_x_adult.shape, test_x_adult.shape, train_y_adult.shape, test_y_adult.shape)
print(train_x_german.shape, test_x_german.shape, train_y_german.shape, test_y_german.shape)

(30162, 103) (15060, 103) (30162,) (15060,)
(700, 61) (300, 61) (700,) (300,)


In [54]:
# train a classifier to predict the outcome y from features x
# training: train_x --> train_y; test: test_x --> preds
# logistic regression model is recommended
# sklearn is allowed to use


# Adult 5 pts

# initialize the model
# TODO.
from sklearn.linear_model import LogisticRegression

# train/fit the model with train_x_adult and train_y_adult
# TODO.
model1 = LogisticRegression()
model1.fit(train_x_adult,train_y_adult)


# predict the outcome from test_x_adult
# TODO. The starter code below just indicate what you need to output in your code.
preds = model1.predict(test_x_adult)


# report acc and two fairness metrics. 
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_adult, preds)
stat_p = stat_parity(preds, test_sens_adult)
eq_op = eq_oppo(preds, test_sens_adult, test_y_adult)
print(acc, stat_p, eq_op)





# German 5 pts

# initialize the model
# TODO.


# train/fit the model with train_x_german and train_y_german
# TODO.
model2 = LogisticRegression()
model2.fit(train_x_german,train_y_german)

# predict the outcome from test_x_german
# TODO. The starter code below just indicate what you need to output in your code.
preds = model2.predict(test_x_german)


# report acc and two fairness metrics
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_german, preds)
stat_p = stat_parity(preds, test_sens_german)
eq_op = eq_oppo(preds, test_sens_german, test_y_german)
print(acc, stat_p, eq_op)

0.8458167330677291 -0.18351302412645248 -0.1010069968257522
0.76 0.09657196211818064 0.08974358974358976


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## 4. Explore possible ways to mitigate bias

### 4. 1 remove protected attribute

In [67]:
# Adult
# remove the sex column from df_train_x_adult and df_test_x_adult. 
# You shouldn't do it in-place. In other words, do not modify df_train_x_adult or df_test_x_adult
# TODO. 4pts. The starter code below just indicate what you need to output in your code.
df_train_x_no_sens_adult = df_train_x_adult.loc[:, ~df_train_x_adult.columns.isin(['sex'])]
df_test_x_no_sens_adult = df_test_x_adult.loc[:, ~df_test_x_adult.columns.isin(['sex'])]


train_x_adult, test_x_adult = process_dfs(df_train_x_no_sens_adult, df_test_x_no_sens_adult, 
                                                   ['workclass', 'education','marital-status',
                                                    'occupation','relationship','race',
                                                    'native-country'])


# German
# remove age column from df_train_x_german and df_test_x_german
# You shouldn't do it in-place. In other words, do not modify df_train_x_german or df_test_x_german
# TODO. 4pts. The starter code below just indicate what you need to output in your code.
df_train_x_no_sens_german = df_train_x_german.loc[:, ~df_train_x_german.columns.isin(['age'])]
df_test_x_no_sens_german = df_test_x_german.loc[:, ~df_test_x_german.columns.isin(['age'])]


train_x_german, test_x_german = process_dfs(df_train_x_no_sens_german, df_test_x_no_sens_german,
                                                     ['checking_account', 'credit_history', 
                                                      'purpose', 'savings_account', 'present_employment_since', 
                                                      'personal_status_sex', 'other_debtors',
                                                     'property', 'other_installment_plans',
                                                     'housing', 'job', 'telephone', 'foreign_worker'])


print(train_x_adult.shape, test_x_adult.shape)
print(train_x_german.shape, test_x_german.shape)

(30162, 102) (15060, 102)
(700, 60) (300, 60)


In [68]:
# train a classifier to predict the outcome y from features x (with protected feature removed)
# training: train_x --> train_y; test: test_x --> preds
# logistic regression model is recommended
# sklearn is allowed to use
# Just use the code in 3.2 again


# Adult 4 pts

# initialize the model
# TODO.
from sklearn.linear_model import LogisticRegression
model3 = LogisticRegression()
# train/fit the model with train_x_adult and train_y_adult
# TODO.

model3.fit(train_x_adult,train_y_adult)

# predict the outcome from test_x_adult
# TODO. The starter code below just indicate what you need to output in your code.
preds = model3.predict(test_x_adult)

# report acc and two fairness metrics
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_adult, preds)
stat_p = stat_parity(preds, test_sens_adult)
eq_op = eq_oppo(preds, test_sens_adult, test_y_adult)
print(acc, stat_p, eq_op)



# German 4 pts

# initialize the model
# TODO.
model4 = LogisticRegression()

# train/fit the model with train_x_german and train_y_german
# TODO.
model4.fit(train_x_german,train_y_german)

# predict the outcome from test_x_german
# TODO. The starter code below just indicate what you need to output in your code.
preds = model4.predict(test_x_german)

# report acc and two fairness metrics
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_german, preds)
stat_p = stat_parity(preds, test_sens_german)
eq_op = eq_oppo(preds, test_sens_german, test_y_german)
print(acc, stat_p, eq_op)

0.8457503320053121 -0.17481028073158084 -0.0711906599316483
0.7666666666666667 0.08323329331732687 0.07932692307692313


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### According to the results, how are the accuracy, stat parity and eq oppo different from the original model? Does explicitly removing the sensitive feature help in mitigating bias? Why or why not?
### <span style="color:red">Please type your response here.</span> 4 points
Accuracy wise, the results between having the sensitive feature and removing the sensitive features are very similar. Equalized opportunity and statistical parity wise, we do see smaller values if removing the sensitive feature. 

Explicitly removing the sensitive feature can help in mitigating bias since there is a definite reduction in bias that can be factored by sex, age, or other sensitive features. The model thus can perform without relying on these sensitive features. However, there might be other factors (proxies) that correlate with sex and age and thus cause bias. 

### 4.2 Augmenting the training set

#### See the example in Figure 1 of https://dl.acm.org/doi/pdf/10.1145/3375627.3375865

In [69]:
df_train_x_adult

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,0,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,0,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,0,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,0,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,1,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30157,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,1,0,0,38,United-States
30158,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,0,0,0,40,United-States
30159,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,1,0,0,40,United-States
30160,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,0,0,0,20,United-States


In [82]:
# Adult
# create a synthetic training set by duplicating df_train_x_adult and df_train_y_adult
# after duplicating flip sex in the synthetic set
# You shouldn't do it in-place. In other words, do not modify df_train_x_adult or df_train_y_adult
# TODO. 8pts. The starter code below just indicate what you need to output in your code.
df_train_x_syn_adult = df_train_x_adult.copy()
df_train_x_syn_adult['sex'] = df_train_x_syn_adult['sex'].apply(lambda x: 1 if x==0 else 0)

df_train_y_syn_adult = df_train_y_adult.copy()
# df_train_y_syn_adult['income'] = df_train_y_syn_adult['income'].apply(lambda x: 1 if x==0 else 0)


# augment the original training set by the synthetic set. In other words, concatenate them
df_train_x_aug_adult = pd.concat((df_train_x_adult, df_train_x_syn_adult))
df_train_y_aug_adult = pd.concat((df_train_y_adult, df_train_y_syn_adult))

print(df_train_x_aug_adult.shape, df_train_y_aug_adult.shape)


train_x_adult, test_x_adult = process_dfs(df_train_x_aug_adult, df_test_x_adult, 
                                                   ['workclass', 'education','marital-status',
                                                    'occupation','relationship','race',
                                                    'native-country'])
train_y_adult = df_train_y_aug_adult.values
print(train_x_adult.shape, test_x_adult.shape, train_y_adult.shape)



# German
# create a synthetic training set by duplicating df_train_x_german and df_train_y_german
# after duplicating flip age in the synthetic set.
# You shouldn't do it in-place. In other words, do not modify df_train_x_german or df_train_y_german
# TODO. 8pts. The starter code below just indicate what you need to output in your code.
df_train_x_syn_german = df_train_x_german.copy()
df_train_x_syn_german['age'] = df_train_x_syn_german['age'].apply(lambda x: 1 if x==0 else 0)

df_train_y_syn_german = df_train_y_german.copy()

# augment the original training set by the synthetic set. In other words, concatenate them
df_train_x_aug_german = pd.concat((df_train_x_german, df_train_x_syn_german))
df_train_y_aug_german = pd.concat((df_train_y_german, df_train_y_syn_german))

train_y_german = df_train_y_aug_german.values

print(df_train_x_aug_german.shape, df_train_y_aug_german.shape, train_y_german.shape)


train_x_german, test_x_german = process_dfs(df_train_x_aug_german, df_test_x_german,
                                                     ['checking_account', 'credit_history', 
                                                      'purpose', 'savings_account', 'present_employment_since', 
                                                      'personal_status_sex', 'other_debtors',
                                                     'property', 'other_installment_plans',
                                                     'housing', 'job', 'telephone', 'foreign_worker'])
print(train_x_german.shape, test_x_german.shape)

(60324, 14) (60324,)
(60324, 103) (15060, 103) (60324,)
(1400, 20) (1400,) (1400,)
(1400, 61) (300, 61)


In [83]:
# train a classifier to predict the outcome y from features x on the augmented training data
# training: train_x --> train_y; test: test_x --> preds
# logistic regression model is recommended
# sklearn is allowed to use
# Just use the code in 3.2 again


# Adult 4 pts

# initialize the model
# TODO.
model5 = LogisticRegression()

# train/fit the model with train_x_adult and train_y_adult
# TODO.

model5.fit(train_x_adult,train_y_adult)

# predict the outcome from test_x_adult
# TODO. The starter code below just indicate what you need to output in your code.
preds = model5.predict(test_x_adult)

# report acc and two fairness metrics
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_adult, preds)
stat_p = stat_parity(preds, test_sens_adult)
eq_op = eq_oppo(preds, test_sens_adult, test_y_adult)
print(acc, stat_p, eq_op)



# German 4 pts

# initialize the model
# TODO.

model6 = LogisticRegression()

# train/fit the model with train_x_german and train_y_german
# TODO.

model6.fit(train_x_german,train_y_german)

# predict the outcome from test_x_german
# TODO. The starter code below just indicate what you need to output in your code.
preds = model6.predict(test_x_german)

# report acc and two fairness metrics
from sklearn.metrics import accuracy_score
acc = accuracy_score(test_y_german, preds)
stat_p = stat_parity(preds, test_sens_german)
eq_op = eq_oppo(preds, test_sens_german, test_y_german)
print(acc, stat_p, eq_op)

0.847011952191235 -0.1749751881616645 -0.07057717386275164
0.77 0.07643057222889149 0.07932692307692313


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### According to the results, how are the accuracy, stat parity and eq oppo different from the original model? Does augmenting the dataset with synthetic data help in mitigating bias? Why or why not?
### <span style="color:red">Please type your response here.</span> 4 points
The accuracies have some slight improvements after augmenting the data. The statistical parity and equalized opportunity all have decreased a little bit. Thus, augmenting the dataset with synthetic data does help in mitigating bias. Because the sensitive feature no longer becomes sensitive as the same data instance is duplicated and the sensitive is inversed. The dataset is now more balanced. Nevertheless, if the data itself still has bias, augmenting the dataset will not mitigating bias in this case and may exacerbate the biases.