# Benchmark - WEFE, Fair Embedding Engine and Responsibly.AI

To the best of our knowledge, we are aware of only three Python libraries that implement bias measurement and mitigation methods: Fair Embedding Engine (FEE) and Responsibly.

According to its authors, Fair Embedding Engine is defined as "A Library for Analyzing and Mitigating Gender Bias in Word Embeddings", while Responsibly is defined as "Toolkit for Auditing and Mitigating Bias and Fairness of Machine Learning Systems."

The FEE and Responsibly documentation can be found at the following links respectively: 
- https://github.com/FEE-Fair-Embedding-Engine/FEE
- https://docs.responsibly.ai/

The following document shows a comparison in various areas between these libraries with respect to WEFE.

The points to be evaluated are:
    
1. Ease of installation
2. Quality of the package and documentation.
3. Ease of loading models
4. Ease of running bias measurements. 
5. Performance in execution times.   


## 0. Metrics Comparison

## 1. Ease of installation

This comparison aims to evaluate how easy it is to install the library.

### WEFE

According to the documentation, WEFE is available for installation using the Python Package Index (via pip) as well as via conda.


```bash
pip install --upgrade wefe
# or
conda install -c pbadilla wefe
```

### Fair Embedding Engine

In the case of FEE, neither the documentation nor the repository indicates how to install the package. Therefore, the easiest thing to do in this case is to clone the repository and then install the requirements manually.

1. Clone the repo
```bash
$ git clone https://github.com/FEE-Fair-Embedding-Engine/FEE
```

2. Install the requirements.
```bash
$ pip install -r FEE/requirements.txt
$ pip install sympy
$ pip install -U gensim==3.8.3
```

### Responsibly

According to its documentation, responsibly is also hosted in the Python Package Index so it can be installed using pip.

```bash
$ pip install responsibly
```

### EmbeddingBiasScores

In the case of EmbeddingBiasScores, the documentation indicates that the repository can be cloned and then installed locally.

```bash
$ git clone https://github.com/HammerLabML/EmbeddingBiasScores.git
$ pip install -r EmbeddingBiasScores/requirements.txt
```

### Conclusion

Both WEFE and responsibly are easy to install, which lowers the initial barriers to entry. FEE and EmbeddingBiasScores requires more knowledge of Python and Pip to be able to use it.

## 2. Source code quality and documentation

This benchmark seeks to compare the quality of the documentation as well as the quality and best practices of the code.

### WEFE

WEFE has a complete documentation page, which explains in detail the use of the package: an about with the motivation and objectives of the project, quick start showing how to install the library, multiple user manuals to measure and mitigate bias, detailed API of the implemented methods, theoretical manuals and finally implementations of previous case studies.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing mechanisms in Github Actions) that ensures good code quality.


### Fair Embedding Engine

FEE has a documentation, which covers only the basic aspects of the API plus a flowchart showing the main concepts of the library.
The documentation does not contain user guides, code examples or theoretical information about the implemented methods.

No tests, linters or continuous code integration mechanisms could be identified, which makes the code prone to errors.

### Responsibly

Responsibly has also a complete documentation page, which explains the use of the package: an index with the main project information and a quick start showing how to install the library, demos that act as user manuals, and a detailed API of the implemented methods.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing in Github Actions) that keep it with a good code quality.


### EmbeddingBiasScores

It was not possible to find formal documentation explaining how to run bias tests in EmbeddingBiasScores.
There is only a small notebook with some use cases, which at the time of creating this document, had several flaws that made it difficult to understand and use.

No tests, linters or continuous code integration mechanisms could be identified, which makes the code prone to errors.

### Conclusion

In terms of documentation, WEFE contains much more detailed documentation than the other libraries with more extensive manuals and replications of previous case studies. 
Responsibly has sufficient documentation to execute its main functionalities without major problems, however, it is not exhaustive.
FEE, only has API documentation, which makes it insufficient for new users to use it directly.
Lastly, EmbeddingBiasScores only presents a notebook with examples.


With respect to software quality, both WEFE and Responsibly comply with best practices. 
FEE and EmbeddingBiasScores lacks testing and code quality control.

## 3. Ease of loading models

This comparison looks at how easy it is to load a Word Embedding model.
In this benchmark, two tests will be compared: loading a model from gensim API (`glove-twitter-25`) and loading a model from a binary file (`word2vec`).

For the second test, you need to download the original word2vec model, which can be downloaded using the following code:

In [None]:
!wget https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/word2vec-google-news-300.gz
!gzip -dv word2vec-google-news-300.gz

### WEFE

In WEFE, models are simply wrappers of Gensim models. This implies that the model reading process (either loaded by the API or by a file) is handled by the Gensim loaders, while the class that generates the objects that allow access to the embeddings is managed by WEFE.

The following code can be used for the loading of the glove model from the gensim API:

In [1]:
from wefe.word_embedding_model import WordEmbeddingModel
import gensim.downloader as api

# load glove
twitter_25 = api.load("glove-twitter-25")
model = WordEmbeddingModel(twitter_25, "glove twitter dim=25")

The following code allows you to load word2vec from its original file.

In [14]:
from wefe.word_embedding_model import WordEmbeddingModel
from gensim.models.keyedvectors import KeyedVectors

# load word2vec
word2vec = api.load("word2vec-google-news-300")
# word2vec = KeyedVectors.load_word2vec_format('word2vec-google-news-300', binary=True)
model = WordEmbeddingModel(word2vec, "word2vec-google-news-300")

### FEE

FEE also offers direct support for loading models from the FEE API through the following code.
In this case, the model loading is coupled to the class which then has the methods to access the embeddings.

In [1]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(ename="glove-twitter-25")

In [1]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(fname="word2vec-google-news-300", format="bin")

: 

: 

### Responsibly and EmbeddingBiasScores

Neither Responsibly nor EmbeddingBiasScores implement intermediate interfaces to handle embedding models, they simply use the gensim or similar interfaces for this purpose. The above can be reflected in the following script:

In [2]:
# load twitter_25 model from gensim api
twitter_25 = api.load("glove-twitter-25")

# load word2vec model from file
word2vec = KeyedVectors.load_word2vec_format("word2vec-google-news-300", binary=True)

NameError: name 'api' is not defined

### Conclusion

Both WEFE and FEE implement interfaces to internally manage the use of embedding models according to their needs.
Responsibly and EmbeddingBiasScores do not implement wrappers, which may make it difficult to use them later.

## 4. Ease of running bias measurements. 

This benchmark is intended to show how easy it is to run queries on the metrics that can be used. 
To keep the comparison simple, the set of words and the embeddings model will be kept fixed; only the metrics executed will be varied.

In [35]:
# words to evaluate

female_terms = ["female", "woman", "girl", "sister", "she", "her", "hers", "daughter"]
male_terms = ["male", "man", "boy", "brother", "he", "him", "his", "son"]

family_terms = [
    "home",
    "parents",
    "children",
    "family",
    "cousins",
    "marriage",
    "wedding",
    "relatives",
]
career_terms = [
    "executive",
    "management",
    "professional",
    "corporation",
    "salary",
    "office",
    "business",
    "career",
]

# optional, only for wefe usage.
target_sets_names = ["Female terms", "Male terms"]
attribute_sets_names = ["Family terms", "Career terms"]

### WEFE

WEFE defines a standardized framework to execute metrics: in short, it is necessary to define a query that will act as a container for the words to be tested and then, together with the model, be delivered as input to some metric.

The outputs of the metrics are always dictionaries since most of them contain additional information that could eventually be useful.

In [36]:
# import the modules
from wefe.query import Query

# 1. create the query
query = Query(
    [female_terms, male_terms],
    [family_terms, career_terms],
    target_sets_names,
    attribute_sets_names,
)
query

<Query: Female ferms and Male terms wrt Family terms and Career terms
- Target sets: [['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter'], ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']]
- Attribute sets:[['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives'], ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']]>

In [39]:
from wefe.metrics.WEAT import WEAT

# 2. instance a WEAT metric and pass the query plus the model.
weat = WEAT()
result = weat.run_query(query, model)
result

{'query_name': 'Female ferms and Male terms wrt Family terms and Career terms',
 'result': 0.31658415612764657,
 'weat': 0.31658415612764657,
 'effect_size': 0.677943967611404,
 'p_value': nan}

As run query is independent of the query and the model, it can take several parameters that customize the performance of the metric. In this case, we show how to standardize the words before searching for them in the model by making them all lowercase and then removing their accents.

In [40]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    preprocessors=[{"lowercase": True, "strip_accents": True}],
)
result

{'query_name': 'Female ferms and Male terms wrt Family terms and Career terms',
 'result': 0.31658415612764657,
 'weat': 0.31658415612764657,
 'effect_size': 0.677943967611404,
 'p_value': nan}

In this case, we show how to calculate the p-value through a permutation test.

In [41]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    calculate_p_value=True,
)
result

{'query_name': 'Female ferms and Male terms wrt Family terms and Career terms',
 'result': 0.31658415612764657,
 'weat': 0.31658415612764657,
 'effect_size': 0.677943967611404,
 'p_value': 0.08699130086991301}

This interface makes it possible for us to switch very easily to similar metrics (i.e. supporting the same number of word sets). 

In [42]:
from wefe.metrics import RNSB

rnsb = RNSB()
result = rnsb.run_query(query, model)
result

{'query_name': 'Female ferms and Male terms wrt Family terms and Career terms',
 'result': 0.2316374702204208,
 'rnsb': 0.2316374702204208,
 'negative_sentiment_probabilities': {'female': 0.13601304241096868,
  'woman': 0.09092891800011083,
  'girl': 0.018634003460932136,
  'sister': 0.02660626015429457,
  'she': 0.024974915876983528,
  'her': 0.012636440088022338,
  'hers': 0.18909707392930308,
  'daughter': 0.02508881957615572,
  'male': 0.08005167123803347,
  'man': 0.06748459956816788,
  'boy': 0.04567972254971964,
  'brother': 0.05599846481670445,
  'he': 0.06136734334386551,
  'him': 0.03019750326319992,
  'his': 0.05847721064473965,
  'son': 0.08768651868532318},
 'negative_sentiment_distribution': {'female': 0.13454349011675998,
  'woman': 0.08994647692175288,
  'girl': 0.018432672455824813,
  'sister': 0.02631879293823218,
  'she': 0.024705074512698814,
  'her': 0.012499909728927259,
  'hers': 0.18705397545951583,
  'daughter': 0.02481774753987463,
  'male': 0.0791867533225321

In [43]:
from wefe.metrics import MAC

mac = MAC()
result = mac.run_query(query, model)
result

{'query_name': 'Female ferms and Male terms wrt Family terms and Career terms',
 'result': 0.4357533518195851,
 'mac': 0.4357533518195851,
 'targets_eval': {'Female ferms': {'female': {'Family terms': 0.31804752349853516,
    'Career terms': 0.43660861998796463},
   'woman': {'Family terms': 0.24169018119573593,
    'Career terms': 0.3958834111690521},
   'girl': {'Family terms': 0.27893901616334915,
    'Career terms': 0.5540130585432053},
   'sister': {'Family terms': 0.26786402612924576,
    'Career terms': 0.5402307193726301},
   'she': {'Family terms': 0.3126588687300682,
    'Career terms': 0.5178160294890404},
   'her': {'Family terms': 0.31602276116609573,
    'Career terms': 0.5942907352000475},
   'hers': {'Family terms': 0.4396950639784336,
    'Career terms': 0.5640630088746548},
   'daughter': {'Family terms': 0.2509744167327881,
    'Career terms': 0.5034244172275066}},
  'Male terms': {'male': {'Family terms': 0.3604205995798111,
    'Career terms': 0.5834408048540354},


### 2. Fair Embedding Engine

In the case of Fair Embedding Engine, the embedding model is passed in the metric instantation.
Then, the metric value is calculated using the compute method of the metric object.

FEE differs somewhat from WEFE normalization by making each instance of the metric model-dependent.
On the other hand, it is not clear how to pass different size of word sets to the compute method: the word sets are delivered directly as star * parameter, which represents an arbitrary number of positional arguments.
This lack of definition makes it difficult to understand how many and which word sets to pass. 

In [44]:
from FEE.fee.metrics import WEAT as FEE_WEAT

fee_weat = FEE_WEAT(fee_model)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

ModuleNotFoundError: No module named 'FEE'

WEAT's implementation of FEE also allows the p_value to be calculated.

In [8]:
fee_weat.compute(female_terms, male_terms, family_terms, career_terms, p_val=True)

(0.7264477, 0.0)

Finally, the metric does not contain the possibility of executing more complex actions such as preprocessing word sets.

We were not able to find any other metric that was easily replaceable using the same or similar interface (with respect to the WEFE standardization layer).

### Responsibly

Similar to WEFE, responsibly has a function that receives the model and word sets as inputs and responds the value of weat as output.


In [7]:
from responsibly.we.weat import calc_single_weat

calc_single_weat(
    twitter_25,
    first_target={"name": "female_terms", "words": female_terms},
    second_target={"name": "male_terms", "words": male_terms},
    first_attribute={"name": "family_terms", "words": family_terms},
    second_attribute={"name": "career_terms", "words": career_terms},
)

{'Target words': 'female_terms vs. male_terms',
 'Attrib. words': 'family_terms vs. career_terms',
 's': 0.31658387184143066,
 'd': 0.6779436,
 'p': 0.09673659673659674,
 'Nt': '8x2',
 'Na': '8x2'}

The same function can be used to calculate weat with the p-value.

In [9]:
calc_single_weat(
    twitter_25,
    first_target={"name": "female_terms", "words": female_terms},
    second_target={"name": "male_terms", "words": male_terms},
    first_attribute={"name": "family_terms", "words": family_terms},
    second_attribute={"name": "career_terms", "words": career_terms},
    with_pvalue=False,
)

{'Target words': 'female_terms vs. male_terms',
 'Attrib. words': 'family_terms vs. career_terms',
 's': 0.31658387184143066,
 'd': 0.6779436,
 'p': None,
 'Nt': '8x2',
 'Na': '8x2'}

Finally, the metric does not contain the possibility of executing more complex actions such as preprocessing word sets.

We were unable to find other metrics directly comparable to those implemented by WEFE.

### EmbeddingBiasScores

EmbeddingBiasScores formalizes how bias is measured in a different way than WEFE: it classifies the methods into clustering or geometric methods (note that WEFE only implements the geometric equivalents).

Under their standardization, each geometric metric must first define the direction of the bias using `define_bias_space` on the `attribute_embeddings`; and then use the `group_bias` or `mean_individual_bias` methods to calculate the value of the metric.

Examples of usage are shown below:



In [139]:
# the embeddings to be used must be transformed by hand from words to arrays.
target_embeddings = [
    [model[word] for word in female_terms],
    [model[word] for word in male_terms],
]
attribute_embeddings = [
    [model[word] for word in family_terms],
    [model[word] for word in career_terms],
]

In [138]:
from EmbeddingBiasScores.geometrical_bias import WEAT

weat = WEAT()
weat.define_bias_space(attribute_embeddings)
# group bias returns the effect size.
weat.group_bias(target_embeddings)

0.6564166411275919

By default, `WEAT` return the effect size. There is no way to parametrize the metric to calculate weat score or the p-value.

Similar to WEFE, the standarization implemented by EmbeddingBiasScores allows to easily change the use metric to some other with the same input word sets.

In [147]:
from EmbeddingBiasScores.geometrical_bias import MAC

mac = MAC()
mac.define_bias_space(attribute_embeddings)

# mac does not accept more than one target set, so we have to calculate it manually.
target_0_mac = mac.mean_individual_bias(target_embeddings[0])
target_1_mac = mac.mean_individual_bias(target_embeddings[1])
(target_0_mac + target_1_mac) / 2

0.4357533518195851

EmbeddingBiasScores includes metrics that WEFE does not implement yet, such as GeneralizedWEAT and SAME.

In [145]:
from EmbeddingBiasScores.geometrical_bias import GeneralizedWEAT

gweat = GeneralizedWEAT()
gweat.define_bias_space(attribute_embeddings)
gweat.group_bias(target_embeddings)

0.019786509

In [146]:
from EmbeddingBiasScores.geometrical_bias import SAME

same = SAME()
same.define_bias_space(attribute_embeddings)
same.mean_individual_bias(target_embeddings[0])

0.31590998622684796

Finally, no metric implements the possibility of of executing more complex actions such as preprocessing word sets or customize some execution settings.

### Conclusion

In WEFE, the use of metric decoupled queries allows both parameterization of metric execution as well as the easy interchange of one metric for another.
Furthermore, the clean and unified interface for all metrics makes it intuitive how to run bias measurements.

Both responsibly and FEE have similar interfaces, where the metric arguments are sets of words (and not queries), making it difficult to standardize inputs across metrics.
We were unable to find any metric other than WEAT to include in the benchmarking on FEE and liability.


On the other hand, EmbeddingBiasScores also presents its own mathematical standardization for each metric as well as some metrics that wefe does not implement. 
While the standardization they present may be a bit more specific, it makes it more complex to use. 

The increased difficulty is mainly due to two factors: first, you have to manually define the bias space (using `define_bias_space`) and then investigate whether to use `group_bias` or `mean_individual_bias`, which is not clear unless you have visited the basics of their standardization.


Finally, the standardization implemented by wefe to execute the metrics allows run_query to execute routines that customize the execution of the metrics, such as word preprocessing, embeddings normalization and the calculation of submetrics or statistical tests.

# Differences between IJCAI version and Current version

The most noticeable change we can mention with respect to the IJCAI version and the current version is the full implementation of a new debiasing methods module. It includes 5 methods of debiasing: `HardDebias`, `MulticlassHardDebias`, `DoubleHardDebias`, `RepulsionAttractionNeutralization` and `HalfSiblingRegression`.

Regarding metrics: The original version of WEFE published in IJCAI contained 4 metrics: `WEAT`, `WEAT-ES`, `RND` and `RNSB`.
Currently and thanks to contributions, WEFE also implements `MAC`, `RIPA` and `ECT`.

Also, the original version contained very rudimentary `Query` and `WordEmbeddingModel` wrapper routines.

In the actual version, the wrappers are much more complete and allow better interaction  with the user and with WEFE's internal APIs.  
For example, the implementation of `__repr__` for Query and `WordEmbeddingModel` that show brief descriptions of each object to the user, as well as the implementation of `dict` method in query that allows to transform a query into a dictionary or update in `WordEmbeddingModel` that allows to update an embedding associated to a word for a new one.

The `preprocessing` module was also improved: now it includes much more advanced logics (such as different preprocessing steps) which were modularized and generalized so that any metric or debias can use it.

The documentation has been greatly enhanced compared to the original version by adding new user guides, as well as conceptual guides (where we explain the theorical framework), multi-language tutorials and detailed metrics and debias methods API documentation which also includes theoretical details.
Finally, it is also worth mentioning that both testing and code quality was greatly improved from the original version.