<a href="https://colab.research.google.com/github/aysegulguzel/aysegulguzel/blob/main/Examining_Representation_Bias_in_Wikipedia_Biographies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


Disaggregators is a library developed by Hugging Face. As the name implies, it "dis-aggregates" data so that we can explore the data in more granular detail and evaluate data bias

In [1]:
pip install datasets disaggregators==0.1.2

Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m




There are multiple disaggregation modules available: age, gender, religion, continent, pronoun

Let´s check the pronoun granularity.

In [2]:
from disaggregators import Disaggregator

disaggregator = Disaggregator("pronoun", column="target_text")


Let´s use Wikipedia biographies dataset, wiki_bio, from Hugging Face Datasets.

The disaggregators library endeavors to categorize the wiki bios into she_her, he_him, and they_them groups.

In [3]:
from datasets import load_dataset

wiki_data = load_dataset(
    "wiki_bio", split="test"
)  #
ds = wiki_data.map(disaggregator)
pdf = ds.to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Map:   0%|          | 0/72831 [00:00<?, ? examples/s]

In [4]:
# Let's take a look at the dataframe
pdf

Unnamed: 0,input_text,target_text,pronoun.she_her,pronoun.he_him,pronoun.they_them
0,"{'table': {'column_header': ['finalyear', 'bat...",leonard shenoff randle -lrb- born february 12 ...,False,True,False
1,"{'table': {'column_header': ['caption', 'const...",philippe adnot -lrb- born 25 august 1945 in rh...,False,True,False
2,"{'table': {'column_header': ['birth_place', 'n...",miroslav popov -lrb- born 14 june 1995 in dvůr...,False,True,False
3,"{'table': {'column_header': ['death_date', 'na...",john `` jack '' reynolds -lrb- 21 february 186...,False,True,False
4,{'table': {'column_header': ['associated_acts'...,"william ato ankrah , -lrb- born 7th july 1979 ...",False,True,False
...,...,...,...,...,...
72826,"{'table': {'column_header': ['finalyear', 'bat...","vernon scot thompson -lrb- born december 7 , 1...",False,True,False
72827,"{'table': {'column_header': ['serviceyears', '...",shabtai shavit -lrb- ; born 17 july 1939 -rrb-...,False,True,False
72828,"{'table': {'column_header': ['birth_place', 'n...",cesar andrade is a brazilian professional vert...,False,True,False
72829,"{'table': {'column_header': ['birth_place', 'b...",moulay hafid elalamy -lrb- born 1960 -rrb- is ...,False,True,False


As it does not do a good job with they/them, we will ignore it

In [5]:
import numpy as np

she_array = np.where(pdf["pronoun.she_her"] == True)
print(f"she_her: {len(she_array[0])} rows")
he_array = np.where(pdf["pronoun.he_him"] == True)
print(f"he_him: {len(he_array[0])} rows")

she_her: 9545 rows
he_him: 44004 rows


OK THIS IS SOMETHING!!!!
WIKI-BIO DATASET (BIOS FROM WIKIPEDIA) TEST SET IN HUGGINGFACE HUB, HAS 9545 SHE-HER AND 44004 HE-HIM!

WHEN WE TALK ABOUT BIAS WE ARE TALKING ABOUT THIS. THE LARGE LANGUAGE MODELS ARE NOT MAGIC. THEY ARE SIMPLY TRYING TO PREDICT THE NEXT WORD AFTER THE SEQUENCE FROM THE WHOLE VOCABULARY YOU GAVE TO THEM. IF YOU GAVE THEM 9545 WOMAN BIO INSTEAD OF 44004 MEN BIO, THE MODEL PREDICTIONS WILL BE .

THE HE/HIM PRONOUN DATA REPRESENTS 44004/ (9545+44004) = 82% OF THE DATA. THE MODELS TRAINED ON MOSTLY ON MEN DATA WOULD EXHIBIT BIAS TOWARD MALES


LET´S CHECK WHAT THE MODELS PREDICT THEN!LET´S CHECK IT IN BERT MODEL, AS WE KNOW THAT BERT IS TRAINED ON WIKIPEDIA DATA FOR EXAMPLE.

In [6]:
from transformers import pipeline

unmasker = pipeline(
    "fill-mask",
    model="bert-base-uncased"
)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

To probe what BERT outputs, we will intentionally insert [MASK] token and ask BERT to generate words to replace that [MASK] token.

In [7]:
result = unmasker("This woman works as a [MASK].")
print([r["token_str"] for r in result])

result = unmasker("This man works as a [MASK].")
print([r["token_str"] for r in result])

['nurse', 'maid', 'teacher', 'waitress', 'prostitute']
['carpenter', 'lawyer', 'farmer', 'businessman', 'doctor']


YEP THAT IS IT! THIS IS VERY SIMPLE SIMULATION OF HOW BIAS WORKS IN LESS THAN 2 MINUTES. AND WE CAN PREVENT THIS BIAS AND CREATE A MUCH MORE DATASET ALL TOGETHER, THERE IS NOT A EASY FIX OF IT.

JESSICA WADE CAN BE AN INSPIRATION FOR US!


> I’ve Made More Than 1,700 Wikipedia Entries on Women Scientists and I’m Not Yet Done!
[British scientist Jessica Wade has made one Wikipedia entry every day since 2017](https://www.vice.com/en/article/z34k9e/wikipedia-pages-women-scientists-jessica-wade-stem)


AND THE BIASES WE SEE IN AI IS JUST THE REFLECTION OF THE BIASES THAT EXIST IN HUMAN LIFE, ALL AROUND THE GLOBE. IT IS NOT THAT EASY TO FIX WITH CODING, PREPARING DATASET ETC.

FOR THAT WE NEED TO WORK ALLTOGETHER AS A COMMUNITY MY FRIEND.


---


