# Benchmark - WEFE, Fair Embedding Engine and Responsibly.AI

To the best of our knowledge, we are aware of only three Python libraries that implement bias measurement and mitigation methods: Fair Embedding Engine (FEE) and Responsibly.

According to its authors, Fair Embedding Engine () is defined as "A Library for Analyzing and Mitigating Gender Bias in Word Embeddings", while Responsibly () is defined as "Toolkit for Auditing and Mitigating Bias and Fairness of Machine Learning Systems."

The FEE and Responsibly documentation can be found at the following links respectively: 
- https://github.com/FEE-Fair-Embedding-Engine/FEE
- https://docs.responsibly.ai/

The following document shows a comparison in various areas between these libraries with respect to WEFE.

The points to be evaluated are:
    
1. Ease of installation
2. Quality of the package and documentation.
3. Ease of loading models
4. Ease of running bias measurements. 
5. Performance in execution times.   


## 0. Metrics Comparison

## 1. Ease of installation

This comparison aims to evaluate how easy it is to install the library.

### WEFE

According to the documentation, WEFE is available for installation using the Python Package Index (via pip) as well as via conda.


```bash
pip install --upgrade wefe
# or
conda install -c pbadilla wefe
```

### Fair Embedding Engine

In the case of FEE, neither the documentation nor the repository indicates how to install the package. Therefore, the easiest thing to do in this case is to clone the repository and then install the requirements manually.

1. Clone the repo
```bash
$ git clone https://github.com/FEE-Fair-Embedding-Engine/FEE
```

2. Install the requirements.
```bash
$ pip install -r FEE/requirements.txt
$ pip install sympy
$ pip install -U gensim==3.8.3
```

### Responsibly

According to its documentation, responsibly is also hosted in the Python Package Index so it can be installed using pip.

```bash
$ pip install responsibly
```

### Conclusion

Both WEFE and responsibly are easy to install, which lowers the initial barriers to entry. FEE, on the other hand, requires more knowledge of Python to be able to use it.

## 2. Quality of the package and documentation

This benchmark seeks to compare the quality of the documentation as well as the quality and best practices of the code.

### WEFE

WEFE has a complete documentation page, which explains in detail the use of the package: an about with the motivation and objectives of the project, quick start showing how to install the library, multiple user manuals to measure and mitigate bias, detailed API of the implemented methods, theoretical manuals and finally implementations of previous case studies.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing in Github Actions) that keep it with a good code quality.


### Fair Embedding Engine

FEE has a documentation, which covers only the basic aspects of the API plus a flowchart showing the main concepts of the library.
The documentation does not contain user guides, code examples or theoretical information about the implemented methods.

On the other hand, no tests, linters or continuous code integration mechanisms could be identified.

### Responsibly

Responsibly has also a complete documentation page, which explains the use of the package: an index with the main project information and a quick start showing how to install the library, demos that act as user manuals, and a detailed API of the implemented methods.

In addition, most of the code is tested and was developed using continuous integration mechanisms (through a linter and testing in Github Actions) that keep it with a good code quality.


### Conclusion

In terms of documentation, WEFE contains much more detailed documentation than the other libraries with more extensive manuals and replications of previous case studies. 
Responsibly has sufficient documentation to execute its main functionalities without major problems, however, it is not exhaustive.
FEE, on the other hand, only has API documentation, which makes it insufficient for new users to use it directly.

With respect to software quality, both WEFE and Responsibly comply with best practices. 
FEE contains neither testing nor mechanisms to control code quality.

## 3. Ease of loading models

This comparison looks at how easy it is to load a Word Embedding model.
In this benchmark, two tests will be compared: loading a model from gensim API (`glove-twitter-25`) and loading a model from a binary file (`word2vec`).

For the second test, you need to download the original word2vec model, which can be downloaded using the following code:

In [None]:
!wget https://github.com/RaRe-Technologies/gensim-data/releases/download/word2vec-google-news-300/word2vec-google-news-300.gz
!gzip -dv word2vec-google-news-300.gz

### WEFE

In WEFE, models are simply wrappers of Gensim models. This implies that the model reading process (either loaded by the API or by a file) is handled by the Gensim loaders, while the class that generates the objects that allow access to the embeddings is managed by WEFE.

The following code can be used for the loading of the glove model from the gensim API:

In [5]:
from wefe.word_embedding_model import WordEmbeddingModel
import gensim.downloader as api

# load glove
twitter_25 = api.load('glove-twitter-25')
model = WordEmbeddingModel(twitter_25, 'glove twitter dim=25')

The following code allows you to load word2vec from its original file.

In [None]:
from wefe.word_embedding_model import WordEmbeddingModel
from gensim.models.keyedvectors import KeyedVectors

# load word2vec
word2vec = KeyedVectors.load_word2vec_format('word2vec-google-news-300', binary=True)
model = WordEmbeddingModel(word2vec, 'word2vec-google-news-300')

### FEE

FEE also offers direct support for loading models from the FEE API through the following code.
In this case, the model loading is coupled to the class which then has the methods to access the embeddings.

In [1]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(ename = 'glove-twitter-25')

In [1]:
from FEE.fee.embedding.loader import WE

fee_model = WE().load(fname = 'word2vec-google-news-300', format='bin')

: 

: 

### Responsibly

Responsibly has no intermediate interfaces to handle embedding models, it simply uses the gensim interface for this purpose. This can be reflected into the following script.


In [None]:
twitter_25 = api.load('glove-twitter-25')

word2vec = KeyedVectors.load_word2vec_format('word2vec-google-news-300', binary=True)

### Conclusion

In this case, the three libraries show similar behaviors and capabilities, which does not allow us to distinguish significant differences between them.

## 4. Ease of running bias measurements. 

This benchmark is intended to show how easy it is to run queries on the metrics that can be used. 
To keep the comparison simple, the set of words and the embeddings model will be kept fixed; only the metrics executed will be varied.

In [4]:
# words to evaluate

female_terms = ["female", "woman", "girl", "sister", "she", "her", "hers", "daughter"]
male_terms = ["male", "man", "boy", "brother", "he", "him", "his", "son"]

family_terms = [
    "home",
    "parents",
    "children",
    "family",
    "cousins",
    "marriage",
    "wedding",
    "relatives",
]
career_terms = [
    "executive",
    "management",
    "professional",
    "corporation",
    "salary",
    "office",
    "business",
    "career",
]

# optional, only for wefe use.
target_sets_names = ["Female Terms", "Male Terms"]
attribute_sets_names = ["Arts", "Science"]


### WEFE

WEFE defines a standardized framework to execute metrics: in short, it is necessary to define a query that will act as a container for the words to be tested and then, together with the model, be delivered as input to some metric.

The outputs of the metrics are always dictionaries since most of them contain additional information that could eventually be useful.

In [6]:
# import the modules
from wefe.query import Query

# 1. create the query
query = Query(
    [female_terms, male_terms],
    [family_terms, career_terms],
    target_sets_names,
    attribute_sets_names,
)
query

<Query: Female Terms and Male Terms wrt Arts and Science
- Target sets: [['female', 'woman', 'girl', 'sister', 'she', 'her', 'hers', 'daughter'], ['male', 'man', 'boy', 'brother', 'he', 'him', 'his', 'son']]
- Attribute sets:[['home', 'parents', 'children', 'family', 'cousins', 'marriage', 'wedding', 'relatives'], ['executive', 'management', 'professional', 'corporation', 'salary', 'office', 'business', 'career']]>

In [7]:
from wefe.metrics.WEAT import WEAT

# 2. instance a WEAT metric and pass the query plus the model.
weat = WEAT()
result = weat.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.31658412935212255,
 'weat': 0.31658412935212255,
 'effect_size': 0.6779439085309583,
 'p_value': nan}

As run query is independent of the query and the model, it can take several parameters that customize the performance of the metric. In this case, we show how to standardize the words before searching for them in the model by making them all lowercase and then removing their accents.

In [14]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    preprocessors=[{"lowercase": True, "strip_accents": True}],
)
result


{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.31658412935212255,
 'weat': 0.31658412935212255,
 'effect_size': 0.6779439085309583,
 'p_value': nan}

In this case, we show how to calculate the p-value through a permutation test.

In [15]:
weat = WEAT()
result = weat.run_query(
    query,
    model,
    calculate_p_value=True,
)
result


{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.31658412935212255,
 'weat': 0.31658412935212255,
 'effect_size': 0.6779439085309583,
 'p_value': 0.0888911108889111}

This interface makes it possible for us to switch very easily to similar metrics (i.e. supporting the same number of word sets). 

In [8]:
from wefe.metrics import RNSB

rnsb = RNSB()
result = rnsb.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.216306751034807,
 'rnsb': 0.216306751034807,
 'negative_sentiment_probabilities': {'female': 0.16418179646736974,
  'woman': 0.10820153944001121,
  'girl': 0.020617838110230324,
  'sister': 0.027141541058957608,
  'she': 0.02994207142697125,
  'her': 0.01769030389841153,
  'hers': 0.19920453536673288,
  'daughter': 0.030796399892449533,
  'male': 0.06336197706763091,
  'man': 0.0960030254620704,
  'boy': 0.056068043904550446,
  'brother': 0.0670556581455255,
  'he': 0.11074181520707649,
  'him': 0.04994848590137757,
  'his': 0.11684191035273095,
  'son': 0.1483104749013432},
 'negative_sentiment_distribution': {'female': 0.1257031346581953,
  'woman': 0.08284275708455253,
  'girl': 0.015785713983500267,
  'sister': 0.020780481539213497,
  'she': 0.02292466227994957,
  'her': 0.013544294805717852,
  'hers': 0.15251772774154254,
  'daughter': 0.023578765039506653,
  'male': 0.04851207202574892,
  'man': 0.073

In [11]:
from wefe.metrics import MAC

rnsb = MAC()
result = rnsb.run_query(query, model)
result

{'query_name': 'Female Terms and Male Terms wrt Arts and Science',
 'result': 0.4357533518195851,
 'mac': 0.4357533518195851,
 'targets_eval': {'Female Terms': {'female': {'Arts': 0.31804752349853516,
    'Science': 0.43660861998796463},
   'woman': {'Arts': 0.24169018119573593, 'Science': 0.3958834111690521},
   'girl': {'Arts': 0.27893901616334915, 'Science': 0.5540130585432053},
   'sister': {'Arts': 0.26786402612924576, 'Science': 0.5402307193726301},
   'she': {'Arts': 0.3126588687300682, 'Science': 0.5178160294890404},
   'her': {'Arts': 0.31602276116609573, 'Science': 0.5942907352000475},
   'hers': {'Arts': 0.4396950639784336, 'Science': 0.5640630088746548},
   'daughter': {'Arts': 0.2509744167327881, 'Science': 0.5034244172275066}},
  'Male Terms': {'male': {'Arts': 0.3604205995798111,
    'Science': 0.5834408048540354},
   'man': {'Arts': 0.3802716135978699, 'Science': 0.5215189531445503},
   'boy': {'Arts': 0.32475025951862335, 'Science': 0.5474967788904905},
   'brother': {

### 2. Fair Embedding Engine

In the case of Fair Embedding Engine, the embedding model is delivered at the time of instantiating the metric and then through the compute method its value is calculated.

In this case, FEE differs somewhat from WEFE normalization by making each instance of the metric model-dependent.

On the other hand, we can see that the word sets are delivered directly as star * (arbitrary number of positional arguments) parameters, which makes it difficult to understand how many and which word sets to pass. 

In [5]:
from FEE.fee.metrics import WEAT as FEE_WEAT
fee_weat = FEE_WEAT(fee_model)

fee_weat.compute(female_terms, male_terms, family_terms, career_terms)

0.7264477

WEAT's implementation of FEE also allows the p_value to be calculated.

In [8]:
fee_weat.compute(female_terms, male_terms, family_terms, career_terms, p_val=True)

(0.7264477, 0.0)

However, it does not contain the possibility of executing more complex actions such as preprocessing word sets.

Finally, we were not able to find any other metric that was easily replaceable using the same interface (unlike with WEFE and its standardization layer).

### Responsibly