# The Effect of Architectures analysis:
(_This message is copied from the slack conversation_)

- We wonder how the architecture itself impacts fairness measures.
- At a minimum, we will report the architectures which converged and:
-- have the best fairness measures
-- are pareto dominant
- Reporting these raw numbers and corresponding architectures will provide some baseline information about top-performers, but it doesn’t answer any more in depth questions.
- In order to look more in depth at the architectures, we can look at how strong of a relationship there is between (the metrics from randomly initialized models) and (the trained models). we can do this through mere correlation numbers or use regressions like: accuracy_at_epoch_20 ~ random_accuracy + feature_size + number_model_params + n_conv + ….
- we can also look at pareto curves for each epoch (20,40,60,80,100) and see if those that pareto dominate in epoch 20 also dominate in epoch 100. this could give us a sense of the persistence of architectures.
- for the above analysis, we can do this by looking at all the hp experiments for all the models, or we could just look at the hps which give maximal performance/pareto dominate for each model


In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'
import numpy as np
import os
from analysis import *
import glob
import plotly.express as px
import statsmodels.api as sm
from stargazer.stargazer import Stargazer

final_models = get_finished_models_Phase1B()
metadata = pd.read_csv('val_identities_gender-expression_seed_222.csv')

In [2]:
default_params = [x for x in glob.glob('../configs/**/*.yaml') if [m for m in final_models if m in x]]
hp_same_lr = [x for x in glob.glob('../configs_multi/**/*.yaml') if [m for m in final_models if m in x]]
hp_unified_lr = [x for x in glob.glob('../configs_unified_lr/**/*.yaml') if [m for m in final_models if m in x]]
    
rank_files = glob.glob('timm_explore_few_epochs/**/*_rank_by_id_val.csv') + glob.glob('Phase1B/**/*_rank_by_id_val.csv')

epochs = [19,39,59,79,99]
epoch_columns = ['epoch_'+str(i) for i in epochs]

In [3]:
acc_df, acc_disp_df, rank_df = analyze_rank_files(rank_files, metadata, epochs=epoch_columns)
_, acc_disp_ratio_df, rank_ratio_df = analyze_rank_files(rank_files, metadata, ratio=True, epochs=epoch_columns)
err_df, error_ratio_df, _ = analyze_rank_files(rank_files, metadata, ratio=True, error=True, epochs=epoch_columns)

acc_disp_df = merge(acc_df, acc_disp_df)
rank_df = merge(acc_df, rank_df)
acc_disp_ratio_df = merge(acc_df, acc_disp_ratio_df)
rank_ratio_df = merge(acc_df, rank_ratio_df)
error_ratio_df = merge(err_df, error_ratio_df).rename(columns={'Accuracy':'Error'})

In [4]:
# models which didn't converge
non_converged_models = list(set(acc_df[acc_df['Metric'] < 0.25]['index']))

In [5]:
[acc_df, acc_disp_df, rank_df, acc_disp_ratio_df, rank_ratio_df, err_df, error_ratio_df] = drop_models([acc_df, acc_disp_df, rank_df, acc_disp_ratio_df, rank_ratio_df, err_df, error_ratio_df], non_converged_models)

# The architectures which converged and have the best fairness measures

## gluon_xception65_MagFace_AdamW
Across all the fairness metrics, this model had the best performance. Note though, the model with overall best fairness metric isn't necessarily the model that a policy maker would choose. For example, this mdoel has overall accuracy <70%.

These are the models which convereged and had the best fairness metric for some epoch in the training procedure (restricted to epochs = 20,40,60,80,100

#### Difference of Accuracies:
This metric is also called Statistical Parity

In [6]:
df = acc_disp_df
df[df.Disparity == df.Disparity.min()]

Unnamed: 0,index,epoch,Accuracy,Disparity
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.657805,0.008905


#### Ratio of Accuracy: 

In [7]:
df = acc_disp_ratio_df
df[df.Disparity == df.Disparity.min()]

Unnamed: 0,index,epoch,Accuracy,Disparity
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.657805,0.01363


#### Difference of Ranks: 

In [8]:
df = rank_df
df[df.Disparity == df.Disparity.min()]

Unnamed: 0,index,epoch,Accuracy,Disparity
545,gluon_xception65_MagFace_AdamW_rank_by_id_val,59,0.658525,0.106993


#### Ratio of Ranks: 

In [9]:
df = rank_ratio_df
df[df.Disparity == df.Disparity.min()]

Unnamed: 0,index,epoch,Accuracy,Disparity
545,gluon_xception65_MagFace_AdamW_rank_by_id_val,59,0.658525,0.003348


#### Ratio of Errors: 

In [10]:
df = error_ratio_df
df[df.Disparity == df.Disparity.min()]

Unnamed: 0,index,epoch,Error,Disparity
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.342195,0.025689


# The architectures which converged and are Pareto efficient

These are the models which convereged and were Pareto efficient for the fairness metrics for some epoch in the training procedure (restricted to epochs = 20,40,60,80,100)

Consistently, rexnet is Pareto efficient for every metric, and has the highest accuracy. Additionally, dpn107 appears as another high accuracy, Pareto efficient model. 

#### Difference of Accuracies: 
This metric is also called Statistical Parity

In [11]:
df = acc_disp_df
ind = whatIsPareto(df[['Accuracy','Disparity']], True, False).astype(bool)
df[ind].dropna().sort_values('Accuracy', ascending=False)

Unnamed: 0,index,epoch,Accuracy,Disparity
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.955474,0.023311
311,rexnet_200_CosFace_SGD_rank_by_id_val,39,0.954885,0.019251
517,rexnet_200_CosFace_SGD_rank_by_id_val,59,0.954623,0.016632
957,gluon_xception65_MagFace_AdamW_rank_by_id_val,99,0.658133,0.012441
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.657805,0.008905


#### Ratio of Accuracies: 

In [12]:
df = acc_disp_ratio_df
ind = whatIsPareto(df[['Accuracy','Disparity']], True, False).astype(bool)
df[ind].dropna().sort_values('Accuracy', ascending=False)

Unnamed: 0,index,epoch,Accuracy,Disparity
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.955474,0.024698
311,rexnet_200_CosFace_SGD_rank_by_id_val,39,0.954885,0.020366
517,rexnet_200_CosFace_SGD_rank_by_id_val,59,0.954623,0.017575
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.657805,0.01363


#### Difference of Ranks: 

In [13]:
df = rank_df
ind = whatIsPareto(df[['Accuracy','Disparity']], True, False).astype(bool)
df[ind].dropna().sort_values('Accuracy', ascending=False)

Unnamed: 0,index,epoch,Accuracy,Disparity
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.955474,2.183735
977,tnt_s_patch16_224_CosFace_AdamW_rank_by_id_val,99,0.948402,1.94945
186,rexnet_200_MagFace_SGD_rank_by_id_val,19,0.947944,1.451414
16,dpn107_CosFace_AdamW_rank_by_id_val,19,0.933473,1.042823
65,dla102x2_CosFace_sgd_rank_by_id_val,19,0.867994,0.566265
389,gluon_inception_v3_CosFace_SGD_rank_by_id_val,39,0.826152,0.47957
899,cspdarknet53_CosFace_AdamW_rank_by_id_val,99,0.763685,0.21451
545,gluon_xception65_MagFace_AdamW_rank_by_id_val,59,0.658525,0.106993


#### Ratio of Ranks: 

In [14]:
df = rank_ratio_df
ind = whatIsPareto(df[['Accuracy','Disparity']], True, False).astype(bool)
df[ind].dropna().sort_values('Accuracy', ascending=False)

Unnamed: 0,index,epoch,Accuracy,Disparity
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.955474,0.406113
929,rexnet_200_CosFace_SGD_rank_by_id_val,99,0.954034,0.403697
186,rexnet_200_MagFace_SGD_rank_by_id_val,19,0.947944,0.389561
16,dpn107_CosFace_AdamW_rank_by_id_val,19,0.933473,0.133601
65,dla102x2_CosFace_sgd_rank_by_id_val,19,0.867994,0.061235
899,cspdarknet53_CosFace_AdamW_rank_by_id_val,99,0.763685,0.012548
545,gluon_xception65_MagFace_AdamW_rank_by_id_val,59,0.658525,0.003348


#### Ratio of Errors: 

In [15]:
df = error_ratio_df
ind = whatIsPareto(df[['Error','Disparity']], False, False).astype(bool)
df[ind].dropna().sort_values('Error')

Unnamed: 0,index,epoch,Error,Disparity
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.044526,0.414918
311,rexnet_200_CosFace_SGD_rank_by_id_val,39,0.045115,0.351675
517,rexnet_200_CosFace_SGD_rank_by_id_val,59,0.045377,0.309756
387,ese_vovnet39b_CosFace_SGD_rank_by_id_val,39,0.102541,0.265781
296,resnetrs101_CosFace_SGD_rank_by_id_val,39,0.104178,0.236142
62,hrnet_w64_CosFace_sgd_rank_by_id_val,19,0.122381,0.174805
969,vgg19_bn_ArcFace_SGD_rank_by_id_val,99,0.228064,0.16491
339,gluon_xception65_MagFace_AdamW_rank_by_id_val,39,0.332504,0.049174
957,gluon_xception65_MagFace_AdamW_rank_by_id_val,99,0.341867,0.035741
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.342195,0.025689


# Pareto Optimality across epochs

The following experiment is Pareto optimal across all models at *Epoch 20*
- rexnet_200_MagFace_SGD

The following experiments are Pareto optimal across all models at *Epoch 40*
- rexnet_200_CosFace_SGD
- gluon_xception65_MagFace_AdamW

The following experiment is Pareto optimal across all models at *Epoch 60*
- rexnet_200_CosFace_SGD

The following experiments are Pareto optimal across all models at *Epoch 80*
- rexnet_200_CosFace_SGD
- gluon_xception65_MagFace_AdamW

The following experiments are Pareto optimal across all models at *Epoch 100*
- rexnet_200_CosFace_SGD
- gluon_xception65_MagFace_AdamW


In [16]:
foo = pd.DataFrame(columns = acc_disp_df.columns)
for e in epochs:
    df = acc_disp_df[acc_disp_df['epoch'] == e]
    ind = whatIsPareto(df[['Accuracy','Disparity']], True, False).astype(bool)
    out = df[ind].dropna().sort_values('Accuracy', ascending=False)
    foo = foo.append(out)

In [17]:
foo

Unnamed: 0,index,epoch,Accuracy,Disparity
186,rexnet_200_MagFace_SGD_rank_by_id_val,19,0.947944,0.021084
311,rexnet_200_CosFace_SGD_rank_by_id_val,39,0.954885,0.019251
339,gluon_xception65_MagFace_AdamW_rank_by_id_val,39,0.667496,0.016763
517,rexnet_200_CosFace_SGD_rank_by_id_val,59,0.954623,0.016632
723,rexnet_200_CosFace_SGD_rank_by_id_val,79,0.955474,0.023311
751,gluon_xception65_MagFace_AdamW_rank_by_id_val,79,0.657805,0.008905
929,rexnet_200_CosFace_SGD_rank_by_id_val,99,0.954034,0.023311
957,gluon_xception65_MagFace_AdamW_rank_by_id_val,99,0.658133,0.012441


# Random models

In [18]:
random_rank = [x for x in glob.glob('random/**/*ank_by_id_val.csv') if [m for m in final_models if m in x]]
rnd_epoch_columns = ['epoch_0']

In [19]:
meta = pd.read_csv('../timm_model_metadata.csv')

In [20]:
epoch_columns = ['epoch_0']
acc_random_df, acc_disp_random_df, rank_random_df = analyze_rank_files(random_rank, metadata, epochs=epoch_columns)
_, acc_disp_ratio_random_df, rank_ratio_random_df = analyze_rank_files(random_rank, metadata, ratio=True, epochs=epoch_columns)
err_random_df, error_ratio_random_df, _ = analyze_rank_files(random_rank, metadata, ratio=True, error=True, epochs=epoch_columns)

acc_disp_random_df = merge(acc_random_df, acc_disp_random_df)
rank_random_df = merge(acc_random_df, rank_random_df)
acc_disp_ratio_random_df = merge(acc_random_df, acc_disp_ratio_random_df)
rank_ratio_random_df = merge(acc_random_df, rank_ratio_random_df)
error_ratio_random_df = merge(err_random_df, error_ratio_random_df).rename(columns={'Accuracy':'Error'})

## We first ask about the predictive performance of the random accuracy and architecture features on the disparity

There isn't a consistent message on how the accuracy of a random moel predicts the disparity of the random model

In [21]:
def regression_with_random(df, col = 'Accuracy'):
    df = df[df[col] > 0]
    df['model'] = df['index'].apply(lambda x: get_name_details(x.replace('_rank_by_id_val',''))[1])
    df = df.merge(meta, left_on='model', right_on='model_name')
    df.fillna('0',inplace=True)

    y = df['Disparity'] # dependent variable
    X = df[[col,
            'feature_dim', 'n_feature_params',
           ]+[x for x in list(meta.columns[8:13])+list(meta.columns[16:]) 
              if len(df[x].unique())>1]].astype(float) # independent variable
    X = pd.get_dummies(data=X, drop_first=True)

    X = sm.add_constant(X) # adding a constant
    lm = sm.OLS(y, X).fit() # fitting the model
    return lm

lm1 = regression_with_random(acc_disp_random_df)
lm2 = regression_with_random(acc_disp_ratio_random_df)
lm3 = regression_with_random(rank_random_df)
lm4 = regression_with_random(rank_ratio_random_df)
lm5 = regression_with_random(error_ratio_random_df, col='Error')

tbl = Stargazer([lm1,lm2,lm3,lm4,lm5])
tbl.show_confidence_intervals(True)
tbl.custom_columns(['Statisitcal Parity', 'Ratio of Accuracies', 
                    'Difference of Ranks', 'Ratio of Ranks', 'Ratio of Errors'], 
                   [1,1,1,1,1])
tbl


0,1,2,3,4,5
,,,,,
,Dependent variable:Disparity,Dependent variable:Disparity,Dependent variable:Disparity,Dependent variable:Disparity,Dependent variable:Disparity
,,,,,
,Statisitcal Parity,Ratio of Accuracies,Difference of Ranks,Ratio of Ranks,Ratio of Errors
,(1),(2),(3),(4),(5)
,,,,,
Accuracy,0.161***,-0.520,-41.993**,0.145,
,"(0.089 , 0.232)","(-1.476 , 0.435)","(-82.234 , -1.752)","(-0.081 , 0.371)",
AdaptiveAvgPool1d,-0.000***,-0.000***,0.029***,0.000***,-0.000***
,"(-0.000 , -0.000)","(-0.001 , -0.000)","(0.015 , 0.043)","(0.000 , 0.000)","(-0.000 , -0.000)"


## We first ask about the predictive performance of the random accuracy and architecture features on the trained accuracy

We see evidence that as the accuracy of a random model increases, the disparity of the trained models gets worse (higher) even when controlling for the architecture features. (Excpet for ratio of ranks. And as errors increase on the random model, the ratio of errors also increases.)

In [22]:
def regression_trained_random(df_trained, df_random, col = 'Accuracy'):
    df_trained = df_trained[df_trained[col] > 0]
    df_random['model'] = df_random['index'].apply(lambda x: get_name_details(x.replace('_rank_by_id_val',''))[1])
    df_random = df_random.groupby('model').mean().merge(meta, left_on='model', right_on='model_name')
    df_trained['model_name'] = df_trained['index'].apply(lambda x: get_name_details(x.replace('_rank_by_id_val',''))[1])
    df = df_trained.merge(df_random, on='model_name', suffixes=['_Trained','_Random'])
    df.fillna('0',inplace=True)

    y = df['Disparity_Trained'].astype(float) # dependent variable
    X = df[[col+'_Random',
            'feature_dim', 'n_feature_params',
           ]+[x for x in list(meta.columns[8:13])+list(meta.columns[16:]) 
          if len(df[x].unique())>1]].astype(float) # independent variable
    X = pd.get_dummies(data=X, drop_first=True)

    X = sm.add_constant(X) # adding a constant
    lm = sm.OLS(y, X).fit() # fitting the model
    lm.summary()
    return lm

lm1 = regression_trained_random(acc_disp_df, acc_disp_random_df)
lm2 = regression_trained_random(acc_disp_ratio_df, acc_disp_ratio_random_df)
lm3 = regression_trained_random(rank_df, rank_random_df)
lm4 = regression_trained_random(rank_ratio_df, rank_ratio_random_df)
lm5 = regression_trained_random(error_ratio_df, error_ratio_random_df, col='Error')

tbl = Stargazer([lm1,lm2,lm3,lm4,lm5])
tbl.show_confidence_intervals(True)
tbl.custom_columns(['Statisitcal Parity', 'Ratio of Accuracies', 
                    'Difference of Ranks', 'Ratio of Ranks', 'Ratio of Errors'], 
                   [1,1,1,1,1])
tbl


0,1,2,3,4,5
,,,,,
,Dependent variable:Disparity_Trained,Dependent variable:Disparity_Trained,Dependent variable:Disparity_Trained,Dependent variable:Disparity_Trained,Dependent variable:Disparity_Trained
,,,,,
,Statisitcal Parity,Ratio of Accuracies,Difference of Ranks,Ratio of Ranks,Ratio of Errors
,(1),(2),(3),(4),(5)
,,,,,
Accuracy_Random,0.798**,3.738***,132.464***,-2.272*,
,"(0.039 , 1.558)","(1.340 , 6.136)","(60.249 , 204.679)","(-4.827 , 0.284)",
AdaptiveAvgPool1d,0.000***,0.001***,0.028***,0.000,-0.003**
,"(0.000 , 0.001)","(0.001 , 0.001)","(0.014 , 0.041)","(-0.000 , 0.000)","(-0.005 , -0.000)"
