# Benchmark - WEFE, Fair Embedding Engine and Responsibly.AI

To the best of our knowledge, we are aware of only three Python libraries that implement bias measurement and mitigation methods: Fair Embedding Engine (FEE) and Responsibly.

According to its authors, Fair Embedding Engine () is defined as "A Library for Analyzing and Mitigating Gender Bias in Word Embeddings", while Responsibly () is defined as "Toolkit for Auditing and Mitigating Bias and Fairness of Machine Learning Systems."

The FEE and Responsibly documentation can be found at the following links respectively: 
- https://github.com/FEE-Fair-Embedding-Engine/FEE
- https://docs.responsibly.ai/

The following document shows a comparison in various areas between these libraries with respect to WEFE.

The points to be evaluated are:
    
1. Ease of installation
2. Quality of the package and documentation.
3. Ease of loading models
4. Ease of running bias measurements. 
5. Performance in execution times.   


## 0. Metrics Comparison

## 1. Ease of installation

This comparison aims to evaluate how easy it is to install the library.

### WEFE

According to the documentation, WEFE is available for installation using the Python Package Index (via pip) as well as via conda.


```bash
pip install --upgrade wefe
# or
conda install -c pbadilla wefe
```

### Fair Embedding Engine

In the case of FEE, neither the documentation nor the repository indicates how to install the package. Therefore, the easiest thing to do in this case is to clone the repository and then install the requirements manually.

1. Clone the repo
```bash
$ git clone https://github.com/FEE-Fair-Embedding-Engine/FEE
```

2. Install the requirements.
```bash
$ pip install -r FEE/requirements.txt
$ pip install sympy
$ pip install -U gensim==3.8.3
```

### Responsibly

According to its documentation, responsibly is also hosted in the Python Package Index so it can be installed using pip.

```bash
$ pip install responsibly
```

### Conclusion

Both WEFE and responsibly are easy to install, which lowers the initial barriers to entry. FEE, on the other hand, requires more knowledge of Python to be able to use it.

## 2. Quality of the package and documentation

This benchmark seeks to compare the quality of the documentation as well as the quality and best practices of the code.

### WEFE

WEFE has a complete documentation page, which explains in detail the use of the package: an about with the motivation and objectives of the project, quick start showing how to install the library, multiple user manuals to measure and mitigate bias, detailed API of the implemented methods, theoretical manuals and finally implementations of previous case studies.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing in Github Actions) that keep it with a good code quality.


### Fair Embedding Engine

FEE has a documentation, which covers only the basic aspects of the API plus a flowchart showing the main concepts of the library.
The documentation does not contain user guides, code examples or theoretical information about the implemented methods.

On the other hand, no tests, linters or continuous code integration mechanisms could be identified.

### Responsibly

Responsibly has also a complete documentation page, which explains the use of the package: an index with the main project information and a quick start showing how to install the library, demos that act as user manuals, and a detailed API of the implemented methods.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing in Github Actions) that keep it with a good code quality.


### Conclusion

In terms of documentation, WEFE contains much more detailed documentation than the other libraries with more extensive manuals and replications of previous case studies. 
Responsibly has sufficient documentation to execute its main functionalities without major problems, however, it is not exhaustive.
FEE, on the other hand, only has API documentation, which makes it insufficient for new users to use it directly.

With respect to software quality, both WEFE and Responsibly comply with best practices. 
FEE contains neither testing nor mechanisms to control code quality.

## 3. Ease of loading models

This comparison looks at how easy it is to load a Word Embedding model.
In this benchmark, two tests will be compared: loading a model from gensim API (`glove-twitter-25`) and loading a model from a binary file (`word2vec`).

For the second test, you need to download the original word2vec model, which can be downloaded using the following code:

### WEFE

In WEFE, models are simply wrappers of Gensim models. This implies that the model reading process (either loaded by the API or by a file) is handled by the Gensim loaders, while the class that generates the objects that allow access to the embeddings is managed by WEFE.

The following code can be used for the loading of the glove model from the gensim API:

In [3]:
!wget https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/word2vec-google-news-300.gz
!gzip -dv word2vec-google-news-300.gz

--2023-01-17 17:11:56--  https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/word2vec-google-news-300.gz
Resolving github.com (github.com)... 140.82.114.3
Connecting to github.com (github.com)|140.82.114.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/106859079/44040504-c5dc-11e7-8524-2dee13a5133a?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230117%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230117T171156Z&X-Amz-Expires=300&X-Amz-Signature=adf4519590f2f57a5103455752d30ec814e1be054fc1670668fc0af41380f99c&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=106859079&response-content-disposition=attachment%3B%20filename%3Dword2vec-google-news-300.gz&response-content-type=application%2Foctet-stream [following]
--2023-01-17 17:11:56--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/

In [None]:
from wefe.word_embedding_model import WordEmbeddingModel
import gensim.downloader as api

# load glove
twitter_25 = api.load('glove-twitter-25')
model = WordEmbeddingModel(twitter_25, 'glove twitter dim=25')

The following code allows you to load word2vec from its original file.

In [1]:
from wefe.word_embedding_model import WordEmbeddingModel
from gensim.models.keyedvectors import KeyedVectors

# load word2vec
word2vec = KeyedVectors.load_word2vec_format('word2vec-google-news-300', binary=True)
model = WordEmbeddingModel(word2vec, 'word2vec-google-news-300')

### FEE

FEE also offers direct support for loading models from the FEE API through the following code.
In this case, the model loading is coupled to the class which then has the methods to access the embeddings.

In [4]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(ename = 'glove-twitter-25')

In [2]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(fname = 'word2vec-google-news-300', format='bin')

In [1]:
import gensim.downloader as api
from gensim.models.keyedvectors import KeyedVectors


### Responsibly

Responsibly has no intermediate interfaces to handle embedding models, it simply uses the gensim interface for this purpose. This can be reflected into the following script.


In [13]:
twitter_25 = api.load('glove-twitter-25')

word2vec = KeyedVectors.load_word2vec_format('word2vec-google-news-300', binary=True)

### Conclusion

In this case, the three libraries show similar behaviors and capabilities, which does not allow us to distinguish significant differences between them.

## 4. Ease of running bias measurements. 

This benchmark is intended to show how easy it is to run queries on the metrics that can be used. 
To keep the comparison simple, the set of words and the embeddings model will be kept fixed; only the metrics executed will be varied.

In [2]:
# words to evaluate

female_terms = ["female", "woman", "girl", "sister", "she", "her", "hers", "daughter"]
male_terms = ["male", "man", "boy", "brother", "he", "him", "his", "son"]

family_terms = [
    "home",
    "parents",
    "children",
    "family",
    "cousins",
    "marriage",
    "wedding",
    "relatives",
]
career_terms = [
    "executive",
    "management",
    "professional",
    "corporation",
    "salary",
    "office",
    "business",
    "career",
]

# optional, only for wefe use.
target_sets_names = ["Female Terms", "Male Terms"]
attribute_sets_names = ["Arts", "Science"]


### WEFE

WEFE defines a standardized framework to execute metrics: in short, it is necessary to define a query that will act as a container for the words to be tested and then, together with the model, be delivered as input to some metric.

The outputs of the metrics are always dictionaries since most of them contain additional information that could eventually be useful.

In [3]:
# import the modules
from wefe.query import Query

# 1. create the query
query = Query(
    [female_terms, male_terms],
    [family_terms, career_terms],
    target_sets_names,
    attribute_sets_names,
)
query

<Query: Female Terms and Male Terms wrt Arts and Science
- Target sets: [['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter'], ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']]
- Attribute sets:[['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives'], ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']]>

In [None]:
from wefe.metrics.WEAT import WEAT

# 2. instance a WEAT metric and pass the query plus the model.
weat = WEAT()
result = weat.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.46343880688073114,
 'weat': 0.46343880688073114,
 'effect_size': 0.45076526581093423,
 'p_value': nan}

As run query is independent of the query and the model, it can take several parameters that customize the performance of the metric. In this case, we show how to standardize the words before searching for them in the model by making them all lowercase and then removing their accents.

In [None]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    preprocessors=[{"lowercase": True, "strip_accents": True}],
)
result


{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.46343880688073114,
 'weat': 0.46343880688073114,
 'effect_size': 0.45076526581093423,
 'p_value': nan}

In this case, we show how to calculate the p-value through a permutation test.

In [None]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    calculate_p_value=True,
)
result


{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.46343880688073114,
 'weat': 0.46343880688073114,
 'effect_size': 0.45076526581093423,
 'p_value': 0.1837816218378162}

This interface makes it possible for us to switch very easily to similar metrics (i.e. supporting the same number of word sets). 

In [None]:
from wefe.metrics import RNSB

rnsb = RNSB()
result = rnsb.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.09665597783037311,
 'rnsb': 0.09665597783037311,
 'negative_sentiment_probabilities': {'female': 0.563239371795448,
  'woman': 0.30234464965107843,
  'girl': 0.19143334905865195,
  'sister': 0.15438537572028865,
  'she': 0.38229504132390235,
  'her': 0.37854574816662556,
  'hers': 0.30166728366749673,
  'daughter': 0.13469535205337924,
  'male': 0.4713249668158237,
  'man': 0.4032154496142717,
  'boy': 0.19473458117802422,
  'brother': 0.17832032007221466,
  'he': 0.510376309937578,
  'him': 0.4625630377446609,
  'his': 0.5307538027998574,
  'son': 0.17367200729993326},
 'negative_sentiment_distribution': {'female': 0.1056027624821933,
  'woman': 0.05668714195722143,
  'girl': 0.035892182798530355,
  'sister': 0.028945991667703896,
  'she': 0.07167718463706767,
  'her': 0.07097422292204784,
  'hers': 0.056560141391103914,
  'daughter': 0.025254273729135232,
  'male': 0.08836956543701145,
  'man': 0.07559958

In [None]:
from wefe.metrics import MAC

rnsb = MAC()
result = rnsb.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.8416415235615204,
 'mac': 0.8416415235615204,
 'targets_eval': {'Female Terms': {'female': {'Arts': 0.9185737599618733,
    'Science': 0.916069650076679},
   'woman': {'Arts': 0.752434104681015, 'Science': 0.9377805145923048},
   'girl': {'Arts': 0.707457959651947, 'Science': 0.9867974997032434},
   'sister': {'Arts': 0.5973392464220524, 'Science': 0.9482253392925486},
   'she': {'Arts': 0.7872791914269328, 'Science': 0.9161583095556125},
   'her': {'Arts': 0.7883057091385126, 'Science': 0.9237247597193345},
   'hers': {'Arts': 0.7385367527604103, 'Science': 0.9480051446007565},
   'daughter': {'Arts': 0.5472579970955849, 'Science': 0.9277344475267455}},
  'Male Terms': {'male': {'Arts': 0.8735092766582966,
    'Science': 0.9468009045813233},
   'man': {'Arts': 0.8249392118304968, 'Science': 0.9350165261421353},
   'boy': {'Arts': 0.7106057899072766, 'Science': 0.9879048476286698},
   'brother': {'Arts': 0.

### 2. Fair Embedding Engine

In the case of Fair Embedding Engine, the embedding model is delivered at the time of instantiating the metric and then through the compute method its value is calculated.

In this case, FEE differs somewhat from WEFE normalization by making each instance of the metric model-dependent.

On the other hand, we can see that the word sets are delivered directly as star * (arbitrary number of positional arguments) parameters, which makes it difficult to understand how many and which word sets to pass. 

In [None]:
from FEE.fee.metrics import WEAT as FEE_WEAT
fee_weat = FEE_WEAT(fee_model)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

WEAT's implementation of FEE also allows the p_value to be calculated.

In [None]:
fee_weat.compute(female_terms, male_terms, family_terms, career_terms, p_val=True)

However, it does not contain the possibility of executing more complex actions such as preprocessing word sets.

Finally, we were not able to find any other metric that was easily replaceable using the same interface (unlike with WEFE and its standardization layer).

### Responsibly

## 5. Ease of running bias mitigation algorithms. 

This benchmark is intended to show how easy it is to execute bias mitigation algoritms. 
To keep the comparison simple, the set of words and the embeddings model will be kept fixed; only the algorithms executed will be varied.

In [5]:
!pip install wefe

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
from wefe.datasets import fetch_debiaswe
from wefe.debias.hard_debias import HardDebias
from wefe.utils import load_test_model

In [6]:
# word sets to be used
debiaswe_wordsets = fetch_debiaswe()

definitional_pairs = debiaswe_wordsets["definitional_pairs"]

gender_specific = debiaswe_wordsets["gender_specific"]

targets = [
    "executive",
    "management",
    "professional",
    "corporation",
    "salary",
    "office",
    "business",
    "career",
    "home",
    "parents",
    "children",
    "family",
    "cousins",
    "marriage",
    "wedding",
    "relatives",
]

### 1. WEFE

WEFE defines an standarize framework to execute bias mitigation algorithms based on scikit-learn's fit transform interface. 

WEFE allows the user to choose the word sets that will be used for the debias proccess, this way the algorithms can be used for any type of bias depending on the words provided.
The fit methods receives parameters that are neccesary for the algorithm to function, such as definitional pairs. The transform method can receive 2 word sets: target and ignore. If target is provided the algorithms is perfomed only on those words. If it's not provided the will be perfomed over al of the words en the embedding, expcept those provided in the ignore parameter (if provided).

In [5]:
from wefe.debias.hard_debias import HardDebias

# 1. instance Hard Debias algortihm
hd = HardDebias(verbose=False, criterion_name="gender")

In [6]:
# 2. apply fit method and pass the model and definitional pairs.
hd.fit(
    model, definitional_pairs=definitional_pairs
)
# 3. apply transform method passing the model, target and ignore word sets resulting in the debiased model
gender_debiased_model = hd.transform(model,  target=targets, ignore=gender_specific, copy=True)

Copy argument is True. Transform will attempt to create a copy of the original model. This may fail due to lack of memory.
Model copy created successfully.


100%|██████████| 16/16 [00:00<00:00, 7653.84it/s]


The interface makes it possible to use different algorithms in a very similar way.

In [7]:
from wefe.debias.repulsion_attraction_neutralization import (
  RepulsionAttractionNeutralization
)

ran = RepulsionAttractionNeutralization().fit(
    model = model,
    definitional_pairs= definitional_pairs
  )

debiased_model = ran.transform(
   model = model, target = targets, ignore=gender_specific, copy=True
)

Copy argument is True. Transform will attempt to create a copyof the original model. This may fail due to lack of memory.
Model copy created successfully.


100%|██████████| 16/16 [00:09<00:00,  1.63it/s]
100%|██████████| 16/16 [00:00<00:00, 51901.67it/s]


It is possible to compare the bias in the model before and after applying the algorithms.

In [10]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    calculate_p_value=True,
)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.46343880688073114,
 'weat': 0.46343880688073114,
 'effect_size': 0.45076526581093423,
 'p_value': 0.18818118188181182}

In [11]:
weat = WEAT()
result = weat.run_query(
    query,
    gender_debiased_model,
    calculate_p_value=True,
)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.006566311581991613,
 'weat': 0.006566311581991613,
 'effect_size': 0.00667966690685079,
 'p_value': 0.49495050494950504}

In [9]:
weat = WEAT()
result = weat.run_query(
    query,
    debiased_model,
    calculate_p_value=True,
)
result


{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': -0.10798480443190783,
 'weat': -0.10798480443190783,
 'effect_size': -0.11753826253471557,
 'p_value': 0.583941605839416}

### 2. Fair Embedding Engine


In the case of Fair Embedding Engine, the embedding model is delivered at the time of instantiating the algorithm. FEE does not allow the user to provide definitional pairs, so it only works on gender bias, since the word sets they use is for this type of bias.

To apply the debias is as simple as executing the run method to the algorithm, in which a word list corresponding to the word that the debias proccess will be applied, must be provided.

In [6]:
from FEE.fee.debias import HardDebias, RANDebias

In [7]:
fee_model.normalize() #model must be normalized

In [13]:
import copy
HD = copy.deepcopy(fee_model)
# instance the algortihm and apply it to the embedding model
HD = HardDebias(HD).run(word_list=targets) 

FEE allows to use different algortihms in a very similar way.

In [9]:
RAN = copy.deepcopy(fee_model)
RAN = RANDebias(RAN).run(words=targets)

  self.sel = torch.FloatTensor(


Comparing bias in the models.

In [12]:
fee_weat = FEE_WEAT(fee_model)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

0.45076525

In [14]:
fee_weat = FEE_WEAT(HD)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

0.056027465

In [11]:
from FEE.fee.metrics import WEAT as FEE_WEAT
fee_weat = FEE_WEAT(RAN)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

-0.039264895

### 3. Responsibly


In the case of Responsibly IA, the embedding model is delivered at the time of instantiating the GenderBiasWe class. Responsibly does not allow the user to provide definitional pairs, the bias to mitigate is set to gender bias.

To perform the debias it is as simple to execute the debias method.

In [14]:
#Bias in the model before applying the algortihm
calc_single_weat(
    word2vec,
    first_target={"name": "female_terms", "words": female_terms},
    second_target={"name": "male_terms", "words": male_terms},
    first_attribute={"name": "family_terms", "words": family_terms},
    second_attribute={"name": "career_terms", "words": career_terms},
)

{'Target words': 'female_terms vs. male_terms',
 'Attrib. words': 'family_terms vs. career_terms',
 's': 0.46343886107206345,
 'd': 0.45076525,
 'p': 0.19704739704739704,
 'Nt': '8x2',
 'Na': '8x2'}

In [9]:
from responsibly.we import GenderBiasWE

In [10]:
gender_bias_we = GenderBiasWE(word2vec) # instance the GenderBiasWE
gender_bias_we.debias(neutral_words=targets) # apply the debias

The debias applied corresponds to Hard Debias, wich es the only one implemented in the library.

In [12]:
#Bias in the model after applying the algortihm
calc_single_weat(
    word2vec,
    first_target={"name": "female_terms", "words": female_terms},
    second_target={"name": "male_terms", "words": male_terms},
    first_attribute={"name": "family_terms", "words": family_terms},
    second_attribute={"name": "career_terms", "words": career_terms},
)

{'Target words': 'female_terms vs. male_terms',
 'Attrib. words': 'family_terms vs. career_terms',
 's': 0.047348424792289734,
 'd': 0.04824888,
 'p': 0.4641025641025641,
 'Nt': '8x2',
 'Na': '8x2'}

One problem found in Responsibly is that if wanted to perform the debias process over an embedding that does not include one of the words included in the library as definitional pairs, an error occurs.




In [8]:
gender_bias_we = GenderBiasWE(twitter_25) 
gender_bias_we.debias() 

KeyError: ignored

### Conclusion
All three libraries provide a simple way of applying the bias mitigation algorithms in a similar way and all of them are able to mitigate bias in the word embedding model by similar amounts, according to the metric used. The major differences among them is that WEFE gives more power to the users allowing them to choose the bias criteria to mitigate, while FEE and Responsibly only work on gender bias. In addition, WEFE includes more algorithms than the other two frameworks.

| Algorithm               | WEFE | FEE | Responsibly | EmbeddingBiasScores |
|-------------------------|------|-----|-------------|---------------------|
| Hard Debias             | ✔    | ✔   | ✔           | ✖                   |
| Double Hard Debias      | ✔    | ✖   | ✖           | ✖                   |
| Half Sibling Regression | ✔    | ✔   | ✖           | ✖                   |
| RAN                     | ✔    | ✔   | ✖           | ✖                   |
| Multiclass HD           | ✔    | ✖   | ✖           | ✖                   |


|                                                | WEFE                                    | FEE                                                                  | Responsibly                               | EmbeddingBiasScores                |
|------------------------------------------------|-----------------------------------------|----------------------------------------------------------------------|-------------------------------------------|------------------------------------|
| Implemented   Metrics                          | 7                                       | 7                                                                    | 3                                         |                                  6 |
| Implemented   Debias Algorithms                | 5                                       | 3                                                                    | 1                                         |                                  0 |
| Extensible                                     | Easy                                    | Easy                                                                 | Difficult,   not very modular.            | Easy                               |
| Well-defined   interface for metrics           | ✔                                       | ✖                                                                    | ✖                                         | ✔                                  |
| Well-defined   interface for debias algorithms | ✔                                       | ✖                                                                    | ✖                                         | ✖                                  |
| Update                               | Updated                                 | Outdated                                                             | Outdated                                  | Updated                            |
| installation                                        | Easy:   pip or conda                    | There   are no instructions. It can be installed from the repository | Only   with pip. Presents problems        | Only   from the repository         |
| Documentation                                  | Extensive   documentation with examples | Almost   no documentation                                            |  Limited documentation with some examples | No   documentation, only examples. |