# WEFE Rankings and Correlations.

The following code will show how to perform a massive amount of bias testing on different bias criteria (gender, ethnicity and religion) and using different embedding models and metrics. Then, from the above, we will create rankings according to the evaluated bias criteria and plot their correlations.

The final idea of this is to see if, through the different bias rankings we calculate, we can safely say that there are embeddings that are less biased than others or not.

For this, 

1. We create the queries. These are separated by type of bias explored: gender, ethnicity and religion.

2. We indicate the list of embeddings to be loaded later. Only public model support is shown in the gensim API (it will be updated later).

3. We create some runners, which are just wrappers to make the queries compatible and the results more robust.


4. We will execute the queries on all the embeddings and using all the metrics.

5. We will create rankings of the results by evaluated bias criteria and also an overall, which contains the sum of all previous rankings.

6. We graph the rankings.

7. We calculate and graph the correlations of the rankings.



In general, this code takes about an hour to run.


In [2]:
%load_ext autoreload
%autoreload 2

In [88]:
import pandas as pd
import numpy as np
from functools import reduce
import gensim.downloader as api
import os

from wefe.datasets import load_weat, fetch_eds, fetch_debias_multiclass, fetch_debiaswe, load_bingliu
from wefe.query import Query
from wefe.word_embedding_model import WordEmbeddingModel
from wefe.metrics import WEAT, RNSB, RND


from wefe.utils import run_queries, plot_queries_results, create_ranking, plot_ranking, calculate_ranking_correlations, plot_ranking_correlations
from plotly.subplots import make_subplots

## Queries

### Load the word sets

In [4]:
WEAT_wordsets = load_weat()
RND_wordsets = fetch_eds()
sentiments_wordsets = load_bingliu()
debias_multiclass_wordsets = fetch_debias_multiclass()

Fetching file...
patool: Extracting ./lexicon.rar ...
patool: running C:\ProgramData\chocolatey\bin\7z.EXE x -o./ -- ./lexicon.rar
patool: ... ./lexicon.rar extracted to `./'.


### Create the ethnicity Queries


In [5]:

eth_1 = Query([RND_wordsets['names_white'], RND_wordsets['names_black']],
              [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['White last names', 'Black last names'],
              ['Pleasant', 'Unpleasant'])

eth_2 = Query([RND_wordsets['names_white'], RND_wordsets['names_asian']],
              [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['White last names', 'Asian last names'],
              ['Pleasant', 'Unpleasant'])

eth_3 = Query([RND_wordsets['names_white'], RND_wordsets['names_hispanic']],
              [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['White last names', 'Hispanic last names'],
              ['Pleasant', 'Unpleasant'])

eth_4 = Query(
    [RND_wordsets['names_white'], RND_wordsets['names_black']],
    [RND_wordsets['occupations_white'], RND_wordsets['occupations_black']],
    ['White last names', 'Black last names'],
    ['Occupations white', 'Occupations black'])

eth_5 = Query(
    [RND_wordsets['names_white'], RND_wordsets['names_asian']],
    [RND_wordsets['occupations_white'], RND_wordsets['occupations_asian']],
    ['White last names', 'Asian last names'],
    ['Occupations white', 'Occupations asian'])

eth_6 = Query(
    [RND_wordsets['names_white'], RND_wordsets['names_hispanic']],
    [RND_wordsets['occupations_white'], RND_wordsets['occupations_hispanic']],
    ['White last names', 'Hispanic last names'],
    ['Occupations white', 'Occupations hispanic'])

eth_sent_1 = Query([RND_wordsets['names_white'], RND_wordsets['names_black']],
                   [
                       sentiments_wordsets['positive_words'],
                       sentiments_wordsets['negative_words']
                   ], ['White last names', 'Black last names'],
                   ['Positive words', 'Negative words'])

eth_sent_2 = Query([RND_wordsets['names_white'], RND_wordsets['names_asian']],
                   [
                       sentiments_wordsets['positive_words'],
                       sentiments_wordsets['negative_words']
                   ], ['White last names', 'Asian last names'],
                   ['Positive words', 'Negative words'])

eth_sent_3 = Query(
    [RND_wordsets['names_white'], RND_wordsets['names_hispanic']], [
        sentiments_wordsets['positive_words'],
        sentiments_wordsets['negative_words']
    ], ['White last names', 'Hispanic last names'],
    ['Positive words', 'Negative words'])

ethnicity_queries = [
    eth_1, eth_2, eth_3, eth_4, eth_5, eth_6, eth_sent_1, eth_sent_2,
    eth_sent_3
]

### Create gender queries

In [6]:
gender_1 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']],
                 [WEAT_wordsets['career'], WEAT_wordsets['family']],
                 ['Male terms', 'Female terms'], ['Career', 'Family'])

gender_2 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']],
                 [WEAT_wordsets['math'], WEAT_wordsets['arts']],
                 ['Male terms', 'Female terms'], ['Math', 'Arts'])

gender_3 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']],
                 [WEAT_wordsets['science'], WEAT_wordsets['arts_2']],
                 ['Male terms', 'Female terms'], ['Science', 'Arts'])

gender_4 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']], [
    RND_wordsets['adjectives_intelligence'],
    RND_wordsets['adjectives_appearance']
], ['Male terms', 'Female terms'], ['Intelligence', 'Appearence'])

gender_5 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']], [
    RND_wordsets['adjectives_intelligence'],
    RND_wordsets['adjectives_sensitive']
], ['Male terms', 'Female terms'], ['Intelligence', 'Sensitive'])

gender_6 = Query([RND_wordsets['male_terms'], RND_wordsets['female_terms']],
                 [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
                 ['Male terms', 'Female terms'], ['Pleasant', 'Unpleasant'])

gender_sent_1 = Query(
    [RND_wordsets['male_terms'], RND_wordsets['female_terms']], [
        sentiments_wordsets['positive_words'],
        sentiments_wordsets['negative_words']
    ], ['Male terms', 'Female terms'], ['Positive words', 'Negative words'])

gender_role_1 = Query(
    [RND_wordsets['male_terms'], RND_wordsets['female_terms']], [
        debias_multiclass_wordsets['male_roles'],
        debias_multiclass_wordsets['female_roles']
    ], ['Male terms', 'Female terms'], ['Man Roles', 'Woman Roles'])

gender_queries = [
    gender_1, gender_2, gender_3, gender_4, gender_5, gender_sent_1,
    gender_role_1
]

### Create religion queries

In [7]:
rel_1 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['islam_terms']
], [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['Christianity terms', 'Islam terms'],
              ['Pleasant', 'Unpleasant'])

rel_2 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['Christianity terms', 'Judaism terms'],
              ['Pleasant', 'Unpleasant'])

rel_3 = Query([
    debias_multiclass_wordsets['islam_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [WEAT_wordsets['pleasant_5'], WEAT_wordsets['unpleasant_5']],
              ['Islam terms', 'Judaism terms'], ['Pleasant', 'Unpleasant'])

rel_4 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['islam_terms']
], [
    debias_multiclass_wordsets['christian_related_words'],
    debias_multiclass_wordsets['muslim_related_words']
], ['Christianity terms', 'Islam terms'],
              ['Christian related words', 'Muslim related words'])

rel_5 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [
    debias_multiclass_wordsets['christian_related_words'],
    debias_multiclass_wordsets['jew_related_words']
], ['Christianity terms', 'Jew terms'],
              ['Christian related words', 'Jew related words'])

rel_6 = Query([
    debias_multiclass_wordsets['islam_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [
    debias_multiclass_wordsets['muslim_related_words'],
    debias_multiclass_wordsets['jew_related_words']
], ['Islam terms', 'Jew terms'], ['Musilm related words', 'Jew related words'])

rel_sent_1 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['islam_terms']
], [
    sentiments_wordsets['positive_words'],
    sentiments_wordsets['negative_words']
], ['Christianity terms', 'Islam terms'], ['Positive words', 'Negative words'])

rel_sent_2 = Query([
    debias_multiclass_wordsets['christianity_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [
    sentiments_wordsets['positive_words'],
    sentiments_wordsets['negative_words']
], ['Christianity terms', 'Jew terms'], ['Positive words', 'Negative words'])

rel_sent_3 = Query([
    debias_multiclass_wordsets['islam_terms'],
    debias_multiclass_wordsets['judaism_terms']
], [
    sentiments_wordsets['positive_words'],
    sentiments_wordsets['negative_words']
], ['Islam terms', 'Jew terms'], ['Positive words', 'Negative words'])

religion_queries = [
    rel_1, rel_2, rel_3, rel_4, rel_5, rel_6, rel_sent_1, rel_sent_2,
    rel_sent_3
]

In [8]:
queries_sets = [[gender_queries, 'Gender'],
                           [ethnicity_queries, 'Ethnicity'],
                           [religion_queries, 'Religion']]

## Models

### Set the models list

In [9]:
models = [
    {
        'name': 'glove-twitter-200',
        'type': 'gensim'
    },
    {
        'name': 'glove-twitter-100',
        'type': 'gensim'
    },
    {
        'name': 'glove-wiki-gigaword-100',
        'type': 'gensim'
    },
    {
        'name': 'glove-wiki-gigaword-200',
        'type': 'gensim'
    },
    {
        'name': 'glove-wiki-gigaword-300',
        'type': 'gensim'
    },
    {
        'name': 'word2vec-google-news-300',
        'type': 'gensim'
    },
    {
        'name': 'fasttext-wiki-news-subwords-300',
        'type': 'gensim'
    },
    {
        'name': 'conceptnet-numberbatch-17-06-300',
        'type': 'gensim',
        'prefix' : '/c/en/'
    },
]

## Metrics

## Metrics wrappers

In [10]:
RNSB_NUM_ITERS = 30


def evaluate_WEAT(queries_set, models_arr):
    return run_queries(WEAT,
                       queries_set[0],
                       models_arr,
                       queries_set_name=queries_set[1],
                       include_average_by_embedding='include',
                       warn_filtered_words=False)

# this wrapper sets WEAT to return the effect size
def evaluate_WEAT_effect_size(queries_set, models_arr):
    return run_queries(WEAT,
                       queries_set[0],
                       models_arr,
                       queries_set_name=queries_set[1],
                       metric_params={'return_effect_size': True},
                       include_average_by_embedding='include',
                       warn_filtered_words=False)

# this wrapper transform the template of the default queries (2,2) 
# to the RND compatible template (2,1)  
# to do this, it uses the get_subqueries method and then, 
# check that there are no replicas
def evaluate_RND(queries_set, models_arr):

    subqueries_arr = []
    for query in queries_set[0]:
        subqueries = query.get_subqueries((2, 1))
        # not so quick duplicated check
        subq_0_duplicated = False
        subq_1_duplicated = False
        for subq in subqueries_arr:
            if subqueries[0].query_name_ == subq.query_name_:
                subq_0_duplicated = True
            if subqueries[1].query_name_ == subq.query_name_:
                subq_1_duplicated = True
        if not subq_0_duplicated:
            subqueries_arr.append(subqueries[0])
        if not subq_1_duplicated:
            subqueries_arr.append(subqueries[1])

    return run_queries(RND,
                       subqueries_arr,
                       models_arr,
                       queries_set_name=queries_set[1],
                       include_average_by_embedding='include',
                       warn_filtered_words=False)


# this wrapper  makes RNSB evaluate N times and then averages them. 
# since results vary with each run, this allows for more robust results. 
def evaluate_RNSB(queries_set, models_arr):

    RNSB_scores_iter = []
    ommited = 0

    # run several times the metric to calculate the avg scores.
    # it avoids outliers.
    for i in range(RNSB_NUM_ITERS):
        try:
            RNSB_scores_iter.append(
                run_queries(RNSB,
                            queries_set[0],
                            models_arr,
                            queries_set_name=queries_set[1],
                            include_average_by_embedding='include',
                            warn_filtered_words=False))
        except Exception as e:
            ommited += 1
    if ommited != 0:
        print('\tIterations ommited: {}'.format(ommited))
    RNSB_scores = reduce(
        (lambda x, y: x + y), RNSB_scores_iter) / RNSB_NUM_ITERS

    return RNSB_scores


runners = [[evaluate_WEAT, 'WEAT'], [evaluate_WEAT_effect_size, 'WEAT_EZ'],
           [evaluate_RND, 'RND'], [evaluate_RNSB, 'RNSB']]

## Run the experimentas


The following code will run the experiments varying these three variables: 

- metrics = WEAT, WEAT effect size, RND, RNSB

- queries = Gender, Ethnicity and Religion.

- embeddings = all specified before.

In [11]:
def run_all(runners, queries_sets, models):

    if not os.path.exists('./results'):
        os.mkdir('./results')

    # the models are loaded in a deferred way so as not to saturate the RAM
    for model in models:
        try:
            print("Loading {}".format(model['name']))
            loaded_model = WordEmbeddingModel(
                api.load(model['name']), model['name'],
                vocab_prefix=models[-1]['prefix']
                if 'prefix' in models[-1] else None)
            print("Load complete. Running experiments...")
        except Exception as e:
            print('Error loading: {}. Error: {} .Breaking.'.format(
                model['name'], e))
            break

        for metric in runners:
            for queries_set in queries_sets:
                results = metric[0](queries_set, [loaded_model])
                if os.path.isfile('./results/{}_{}.csv'.format(
                        queries_set[1], metric[1])):
                    saved_results = pd.read_csv(
                        './results/{}_{}.csv'.format(queries_set[1],
                                                     metric[1]), index_col=0)
                    results = pd.concat([results, saved_results], axis=0)
                results.to_csv('./results/{}_{}.csv'.format(
                    queries_set[1], metric[1]))
        print(
            'The experiments were successful.\n----------------------------------\n'
        )


### Run!

In [28]:
run_all(runners, queries_sets, models)

Loading conceptnet-numberbatch-17-06-300
Load complete. Running experiments...
	Iterations ommited: 1
The experiments were successful.
----------------------------------



## Rankings

The following function will make the rankings from the previous results. 
It also will create the overall ranking, which is only the sum of the other rankings.
Finally, the output will contain the rankings and it plots.

In [72]:
def make_rankings(runners, queries_sets):
    rankings = []
    ranking_plots = []
    for queries_set in queries_sets:
        queries_set_name = queries_set[1]
        results_by_queries_set = []
        for metric in runners:
            metric_name = metric[1]
            current_result = pd.read_csv(
                './results/{}_{}.csv'.format(queries_set_name, metric_name),
                index_col=0)
            # change the name of the average column to WEAT effect size. 
            if metric_name == 'WEAT_EZ':
                current_result.columns = list(current_result.columns)[0:-1] + [
                    list(current_result.columns)[-1].replace(
                        "WEAT", "WEAT EZ")
                ]
            results_by_queries_set.append(current_result)
        current_ranking = create_ranking(results_by_queries_set)
        current_ranking.columns = map(lambda x: x.split(':')[0] , current_ranking.columns)

        ranking_plot = plot_ranking(current_ranking, use_metric_as_facet=False)
        ranking_plot.update_layout(width=1200)
        rankings.append(current_ranking)
        ranking_plots.append(ranking_plot)


    general_ranking = reduce(lambda x, y: x.add(y, fill_value=0),  rankings)
    general_ranking_plot = plot_ranking(general_ranking, use_metric_as_facet=False)
    general_ranking_plot.update_layout(width=1200)

    rankings.append(general_ranking)
    ranking_plots.append(general_ranking_plot)

    return rankings, ranking_plots


rankings, ranking_plots = make_rankings(runners, queries_sets)

### Gender Rankings

In [73]:
rankings[0]

Unnamed: 0_level_0,WEAT,WEAT EZ,RND,RNSB
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
conceptnet-numberbatch-17-06-300,1,1,1,1
fasttext-wiki-news-subwords-300,3,7,2,2
word2vec-google-news-300,6,8,3,3
glove-wiki-gigaword-300,5,5,6,8
glove-wiki-gigaword-200,8,6,8,7
glove-wiki-gigaword-100,7,4,7,6
glove-twitter-100,4,3,5,4
glove-twitter-200,2,2,4,5


In [79]:
ranking_plots[0].show()

### Ethnicity Rankings

In [74]:
rankings[1]

Unnamed: 0_level_0,WEAT,WEAT EZ,RND,RNSB
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
conceptnet-numberbatch-17-06-300,1,2,1,1
fasttext-wiki-news-subwords-300,2,1,2,2
word2vec-google-news-300,3,3,3,3
glove-wiki-gigaword-300,6,7,8,7
glove-wiki-gigaword-200,7,6,7,6
glove-wiki-gigaword-100,8,8,6,8
glove-twitter-100,5,5,5,4
glove-twitter-200,4,4,4,5


In [81]:
ranking_plots[1].show()

### Religion Rankings

In [75]:
rankings[2]

Unnamed: 0_level_0,WEAT,WEAT EZ,RND,RNSB
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
conceptnet-numberbatch-17-06-300,1,4,1,1
fasttext-wiki-news-subwords-300,3,3,2,2
word2vec-google-news-300,4,5,3,3
glove-wiki-gigaword-300,6,7,5,7
glove-wiki-gigaword-200,8,8,6,8
glove-wiki-gigaword-100,7,6,4,6
glove-twitter-100,5,1,8,4
glove-twitter-200,2,2,7,5


In [82]:
ranking_plots[2].show()

### Overall Rankings 

In [76]:
rankings[3]

Unnamed: 0_level_0,WEAT,WEAT EZ,RND,RNSB
model_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
conceptnet-numberbatch-17-06-300,3,7,3,3
fasttext-wiki-news-subwords-300,8,11,6,6
word2vec-google-news-300,13,16,9,9
glove-wiki-gigaword-300,17,19,19,22
glove-wiki-gigaword-200,23,20,21,21
glove-wiki-gigaword-100,22,18,17,20
glove-twitter-100,14,9,18,12
glove-twitter-200,8,8,15,15


In [77]:
ranking_plots[3]

In [87]:
for i, queries_set in enumerate(queries_sets):
    queries_set_name = queries_set[1]
    rankings[i].to_csv('./results/{}_ranking.csv'.format(queries_set_name))
    ranking_plots[i].write_image('./results/{}_ranking.png'.format(queries_set_name), width = 1200, height= 600, scale=3)

rankings[3].to_csv('./results/overall_ranking.csv')
ranking_plots[3].write_image('./results/overall_ranking.png', width = 1200, height= 600, scale=3)

## Correlations between rankings

The last step. Here, we will calculate the correlation between the rankings by metric.
These results will show how the rankings by metric match.

The bluer the correlation matrix, the more confident it is that the rankings are indicating the same thing: that there are some embeddings that are less biased than others.

In [99]:
gender_correlations = calculate_ranking_correlations(rankings[0])
ethnicity_correlations = calculate_ranking_correlations(rankings[1])
religion_correlations = calculate_ranking_correlations(rankings[2])
overall_correlations = calculate_ranking_correlations(rankings[3])

In [106]:
correlations_plot = make_subplots(2,2)

fig = make_subplots(rows=2,
                    cols=2,
                    subplot_titles=("Gender Ranking Correlation",
                                    "Ehtnicity Ranking Correlation",
                                    "Religion Ranking Correlation",
                                    "Overall Ranking Correlation"))
fig.add_trace(plot_ranking_correlations(gender_correlations).data[0], row=1, col=1)
fig.add_trace(plot_ranking_correlations(ethnicity_correlations).data[0], row=1, col=2)
fig.add_trace(plot_ranking_correlations(religion_correlations).data[0], row=2, col=1)
fig.add_trace(plot_ranking_correlations(overall_correlations).data[0], row=2, col=2)

fig.update_layout(width=1200, height = 800)
fig.show()