# Parte 1

### 1.1

Elegimos un dataset llamado [\"salary.csv\"](https://www.kaggle.com/datasets/ayessa/salary-prediction-classification) que tiene de utilidad original crear un algoritmo predicitvo para poder predecir si una persona gana mas de 50.000 USD al año. Creer que existe una desigualdad en la proporcion de quien gana mas dependiendo del sexo en este dataset es valido, es por esta razon que elegimos esta dataset para poder estudiar su posible sesgo.

Primero, importar los paquetes necesarios para poder utilizar AIF360 y el dataset.



In [1]:
import sys
sys.path.insert(1, "../")

from aif360.metrics import BinaryLabelDatasetMetric
from aif360.algorithms.preprocessing import Reweighing
from IPython.display import Markdown, display
from aif360.datasets import StandardDataset

import pandas as pd
import numpy as np

np.random.seed(0)

### 1.2

Preprocesamos el dataset para poder ocupar las funciones de AIF360.

In [2]:
salary = pd.read_csv('salary.csv')

salary['sex'] = salary['sex'].apply(lambda x: 0 if x==' Female' else 1)
salary['salary'] = salary['salary'].apply(lambda x: 1 if x==' >50K' else 0)
salary_aif = StandardDataset(salary, label_name='salary', protected_attribute_names=['sex'],
                              privileged_classes=[[1]], favorable_classes=[1],
                                features_to_drop=['workclass', 'fnlwgt', 'education', 'marital-status',
                                                  'occupation', 'relationship', 'race', 'native-country'])


salary_aif_train, salary_aif_test = salary_aif.split([0.7], shuffle=True)
# salary_aif_test, salary_val_aif = salary_aif_test.split([0.7], shuffle=True)
privileged_groups = [{'sex': 1}]
unprivileged_groups = [{'sex': 0}]

In [3]:
print(salary_aif.features.shape, salary_aif_train.features.shape, salary_aif_test.features.shape)


(32561, 6) (22792, 6) (9769, 6)


Ahora el dataset puede ocupar funciones de AIF360, ya que fue procesado correctamente.

### 1.3

Seleccionamos al grupo \"Male\" como privilegiado, y \"Female\" como no privilegiado. Decidimos hacer esto ya que el numero de articulos que apoya esta suposicion es bien grande. [Por ejemplo](https://www.payscale.com/gender-lifetime-earnings-gap) en este articulo establecen que los hombres ganan mas que las mujeres, pero no *tanto* mas cuando trabajan el mismo oficio.

### 1.4

Vamos a calcular dos metricas. La diferencia en el promedio, en donde se resta los resultados favorables para el grupo privilegiado con el grupo no privilegiado. Tambien vamos a calcular "Disparate Impact" para poder observar la proporcion de resultados favorables para el grupo no privilegiado en comparacion del grupo privilegiado.



In [4]:
mean_train = BinaryLabelDatasetMetric(salary_aif_train,
                                             unprivileged_groups=unprivileged_groups,
                                             privileged_groups=privileged_groups)
print("Diferencia en promedio = %f" % mean_train.mean_difference())
print("Disparidad de impacto = %f" % mean_train.disparate_impact())

Diferencia en promedio = -0.195396
Disparidad de impacto = 0.363245


Como el resultado de la diferencia en promedio (mean_difference) es negativo, indica que hay menos resultados favorables para el grupo no privilegiado, por lo que si existe un sesgo. Tambien tenemos que la disparidad de impacto es bastante baja, por lo que el grupo priveilegiado en proporcion ganan mas que el grupo no privilegiado, lo optimo es que la disparidad de impacto tienda a 1.

Como el resultado de la diferencia en promedio (mean_difference) es negativo, indica que hay menos resultados favorables para el grupo no privilegiado, por lo que si existe un sesgo.

### 1.5

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

In [6]:
salary_aif_test, salary_val_aif = salary_aif_test.split([0.7], shuffle=True)

train_df, _ = salary_aif_train.convert_to_dataframe()
val_df, _ = salary_val_aif.convert_to_dataframe()
test_df, _ = salary_aif_test.convert_to_dataframe()

print(f'Train set: {train_df.shape}')
print(f'Val set: {val_df.shape}')
print(f'Test set: {test_df.shape}')

Train set: (22792, 7)
Val set: (2931, 7)
Test set: (6838, 7)


In [7]:
x_train = train_df.drop('salary', axis=1)
y_train = train_df.salary

x_val = val_df.drop('salary', axis=1)
y_val = val_df.salary

In [8]:
logistic = LogisticRegression(C=0.5, penalty='l1', solver='liblinear')
logistic.fit(x_train, y_train, sample_weight=None)

linear = LinearRegression()
linear.fit(x_train, y_train)

LinearRegression()

In [9]:
def evaluate(model, X, y_true):
    if isinstance(model, LogisticRegression):
        y_pred = model.predict_proba(X)[:, 1]
    elif isinstance(model, LinearRegression):
        y_pred = model.predict(X)

    accuracy = accuracy_score(y_true, y_pred >= 0.5)
    auc = roc_auc_score(y_true, y_pred)
    return accuracy, auc

In [10]:
print('Regresion logistica')
accuracy, auc = evaluate(logistic, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

print('\nRegresion lineal')
accuracy, auc = evaluate(linear, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

Regresion logistica
Accuracy: 0.8225861480723302
AUC: 0.8417734073349594

Regresion lineal
Accuracy: 0.8106448311156602
AUC: 0.8327195990400214


In [11]:
logistic_pred_df = train_df.copy()
logistic_pred_df['predict'] = logistic.predict(x_train)

linear_pred_df = train_df.copy()
linear_pred_df['predict'] = linear.predict(x_train)

In [12]:
# false positives
logistic_pred_df[logistic_pred_df['salary'] == 0][logistic_pred_df['predict'] == 1]

Boolean Series key will be reindexed to match DataFrame index.


Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,salary,predict
12319,29.0,13.0,1.0,0.0,0.0,75.0,0.0,1.0
6964,29.0,16.0,1.0,0.0,0.0,60.0,0.0,1.0
21840,56.0,16.0,1.0,0.0,0.0,50.0,0.0,1.0
19520,63.0,13.0,1.0,0.0,0.0,40.0,0.0,1.0
26360,51.0,13.0,1.0,0.0,0.0,50.0,0.0,1.0
...,...,...,...,...,...,...,...,...
11321,75.0,10.0,1.0,0.0,1735.0,40.0,0.0,1.0
10328,70.0,14.0,1.0,0.0,0.0,8.0,0.0,1.0
15323,29.0,14.0,1.0,0.0,0.0,60.0,0.0,1.0
32340,75.0,14.0,1.0,0.0,0.0,45.0,0.0,1.0


In [13]:
logistic_pred_df

Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,salary,predict
22278,27.0,10.0,0.0,0.0,0.0,44.0,0.0,0.0
8950,27.0,13.0,0.0,0.0,0.0,40.0,0.0,0.0
7838,25.0,12.0,1.0,0.0,0.0,40.0,0.0,0.0
16505,46.0,3.0,1.0,0.0,1902.0,40.0,0.0,0.0
19140,45.0,7.0,1.0,0.0,2824.0,76.0,1.0,1.0
...,...,...,...,...,...,...,...,...
19631,34.0,13.0,1.0,0.0,0.0,40.0,1.0,0.0
29694,56.0,13.0,0.0,0.0,0.0,40.0,0.0,0.0
16546,34.0,11.0,1.0,0.0,0.0,40.0,0.0,0.0
17973,25.0,9.0,1.0,0.0,0.0,40.0,0.0,0.0


In [14]:
x_train = train_df.drop('salary', axis=1)
y_train = train_df.salary

x_val = val_df.drop('salary', axis=1)
y_val = val_df.salary

In [15]:

logistic = LogisticRegression(C=0.5, penalty='l1', solver='liblinear')
logistic.fit(x_train, y_train, sample_weight=None)

linear = LinearRegression()
linear.fit(x_train, y_train)

LinearRegression()

In [16]:
def evaluate(model, X, y_true):
    if isinstance(model, LogisticRegression):
        y_pred = model.predict_proba(X)[:, 1]
    elif isinstance(model, LinearRegression):
        y_pred = model.predict(X)

    accuracy = accuracy_score(y_true, y_pred >= 0.5)
    auc = roc_auc_score(y_true, y_pred)
    return accuracy, auc

In [17]:
print('Regresion logistica')
accuracy, auc = evaluate(logistic, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

print('\nRegresion lineal')
accuracy, auc = evaluate(linear, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

Regresion logistica
Accuracy: 0.8229273285568065
AUC: 0.8417622782864408

Regresion lineal
Accuracy: 0.8106448311156602
AUC: 0.8327195990400214


In [18]:
logistic_aif360 = StandardDataset(logistic_pred_df, label_name='salary', protected_attribute_names=['sex'],
                              privileged_classes=[[1]], favorable_classes=[1])

linear_aif360 = StandardDataset(linear_pred_df, label_name='salary', protected_attribute_names=['sex'],
                                  privileged_classes=[[1]], favorable_classes=[1])



logistic_metrics = BinaryLabelDatasetMetric(logistic_aif360,
                                            unprivileged_groups=unprivileged_groups,
                                            privileged_groups=privileged_groups)

linear_metrics = BinaryLabelDatasetMetric(linear_aif360,
                                          unprivileged_groups=unprivileged_groups,
                                          privileged_groups=privileged_groups)

print("Diferencia en promedio regresion logistica = %f" % logistic_metrics.mean_difference())
print("Disparidad de impacto regresion logistica = %f" % logistic_metrics.disparate_impact())

print("Diferencia en promedio regresion lineal = %f" % linear_metrics.mean_difference())
print("Disparidad de impacto regresion lineal = %f" % linear_metrics.disparate_impact())

Diferencia en promedio regresion logistica = -0.195396
Disparidad de impacto regresion logistica = 0.363245
Diferencia en promedio regresion lineal = -0.195396
Disparidad de impacto regresion lineal = 0.363245


In [19]:
logistic_pred_df = train_df.copy()
logistic_pred_df['predict'] = logistic.predict(x_train)

linear_pred_df = train_df.copy()
linear_pred_df['predict'] = linear.predict(x_train)

### 1.6

In [20]:
# false positives sex == 0
fp_0 = logistic_pred_df[logistic_pred_df['salary'] == 0][logistic_pred_df['predict'] == 1][logistic_pred_df['sex'] == 0].shape[0]

# false positives sex == 1
fp_1 = logistic_pred_df[logistic_pred_df['salary'] == 0][logistic_pred_df['predict'] == 1][logistic_pred_df['sex'] == 1].shape[0]

# false negetive sex == 0
fn_0 = logistic_pred_df[logistic_pred_df['salary'] == 1][logistic_pred_df['predict'] == 0][logistic_pred_df['sex'] == 0].shape[0]

# false negetive sex == 0
fn_1 = logistic_pred_df[logistic_pred_df['salary'] == 1][logistic_pred_df['predict'] == 0][logistic_pred_df['sex'] == 1].shape[0]


# true positive sex == 0
tp_0 = logistic_pred_df[logistic_pred_df['salary'] == 1][logistic_pred_df['predict'] == 1][logistic_pred_df['sex'] == 0].shape[0]

# true positive sex == 1
tp_1 = logistic_pred_df[logistic_pred_df['salary'] == 1][logistic_pred_df['predict'] == 1][logistic_pred_df['sex'] == 1].shape[0]

n_0 = logistic_pred_df[logistic_pred_df['sex'] == 0].shape[0]
n_1 = logistic_pred_df[logistic_pred_df['sex'] == 1].shape[0]

tp_1/n_1

Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.
Boolean Series key will be reindexed to match DataFrame index.


0.1437082624067042

In [21]:
logistic = LogisticRegression(C=0.5, penalty='l1', solver='liblinear')
logistic.fit(x_train, y_train, sample_weight=None)

linear = LinearRegression()
linear.fit(x_train, y_train)

LinearRegression()

In [22]:
def evaluate(model, X, y_true):
    if isinstance(model, LogisticRegression):
        y_pred = model.predict_proba(X)[:, 1]
    elif isinstance(model, LinearRegression):
        y_pred = model.predict(X)

    accuracy = accuracy_score(y_true, y_pred >= 0.5)
    auc = roc_auc_score(y_true, y_pred)
    return accuracy, auc

In [23]:
print('Regresion logistica')
accuracy, auc = evaluate(logistic, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

print('\nRegresion lineal')
accuracy, auc = evaluate(linear, x_val, y_val)
print(f'Accuracy: {accuracy}')
print(f'AUC: {auc}')

Regresion logistica
Accuracy: 0.8222449675878539
AUC: 0.841712524893063

Regresion lineal
Accuracy: 0.8106448311156602
AUC: 0.8327195990400214


In [24]:
logistic_pred_df = train_df.copy()
logistic_pred_df['predict'] = logistic.predict(x_train)

linear_pred_df = train_df.copy()
linear_pred_df['predict'] = linear.predict(x_train)

In [25]:
# false positives
logistic_pred_df[logistic_pred_df['salary'] == 0][logistic_pred_df['predict'] == 1]

Boolean Series key will be reindexed to match DataFrame index.


Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,salary,predict
12319,29.0,13.0,1.0,0.0,0.0,75.0,0.0,1.0
6964,29.0,16.0,1.0,0.0,0.0,60.0,0.0,1.0
21840,56.0,16.0,1.0,0.0,0.0,50.0,0.0,1.0
19520,63.0,13.0,1.0,0.0,0.0,40.0,0.0,1.0
26360,51.0,13.0,1.0,0.0,0.0,50.0,0.0,1.0
...,...,...,...,...,...,...,...,...
11321,75.0,10.0,1.0,0.0,1735.0,40.0,0.0,1.0
10328,70.0,14.0,1.0,0.0,0.0,8.0,0.0,1.0
15323,29.0,14.0,1.0,0.0,0.0,60.0,0.0,1.0
32340,75.0,14.0,1.0,0.0,0.0,45.0,0.0,1.0


In [26]:
logistic_pred_df

Unnamed: 0,age,education-num,sex,capital-gain,capital-loss,hours-per-week,salary,predict
22278,27.0,10.0,0.0,0.0,0.0,44.0,0.0,0.0
8950,27.0,13.0,0.0,0.0,0.0,40.0,0.0,0.0
7838,25.0,12.0,1.0,0.0,0.0,40.0,0.0,0.0
16505,46.0,3.0,1.0,0.0,1902.0,40.0,0.0,0.0
19140,45.0,7.0,1.0,0.0,2824.0,76.0,1.0,1.0
...,...,...,...,...,...,...,...,...
19631,34.0,13.0,1.0,0.0,0.0,40.0,1.0,0.0
29694,56.0,13.0,0.0,0.0,0.0,40.0,0.0,0.0
16546,34.0,11.0,1.0,0.0,0.0,40.0,0.0,0.0
17973,25.0,9.0,1.0,0.0,0.0,40.0,0.0,0.0


In [27]:
# # logistic_aif360 = StandardDataset(logistic_pred_df, label_name='salary', protected_attribute_names=['sex'],
# #                               privileged_classes=[[1]], favorable_classes=[1])
#
# linear_aif360 = StandardDataset(linear_pred_df, label_name='salary', protected_attribute_names=['sex'],
#                                   privileged_classes=[[1]], favorable_classes=[1])
#
#
# #
# # logistic_metrics = BinaryLabelDatasetMetric(logistic_aif360,
# #                                             unprivileged_groups=unprivileged_groups,
# #                                             privileged_groups=privileged_groups)
#
# linear_metrics = BinaryLabelDatasetMetric(linear_aif360,
#                                           unprivileged_groups=unprivileged_groups,
#                                           privileged_groups=privileged_groups)
#
# # print("Diferencia en promedio regresion logistica = %f" % logistic_metrics.mean_difference())
# # print("Disparidad de impacto regresion logistica = %f" % logistic_metrics.disparate_impact())
#
# # print("Diferencia en promedio regresion lineal = %f" % linear_metrics.mean_difference())
# print("Disparidad de impacto regresion lineal = %f" % linear_metrics.disparate_impact())

Como el resultado de la diferencia en promedio (mean_difference) es negativo, indica que hay menos resultados favorables para el grupo no privilegiado, por lo que si existe un sesgo.