## Resources
[notion](https://www.notion.so/Thesis-1dce023bf9ae4ff6820902b87fd41289)


[Exposing Bias BERT](https://github.com/keitakurita/contextual_embedding_bias_measure/blob/master/notebooks/Exposing_Bias_BERT.ipynb)

[WEAT Test Notebook](https://github.com/keitakurita/contextual_embedding_bias_measure/blob/master/notebooks/weat_result_replication.ipynb)


## Log prob Bias Score


In [None]:
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from typing import *
import matplotlib.pyplot as plt
%matplotlib inline
import sys

In [None]:
! pip install transformers

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# copy the data from drive folder to content folder
# ! cp -R /content/drive/MyDrive/Contextual_Bias_Data/bn_glove.39M.300d.txt /content/

In [None]:
# ! dir

bn_glove.39M.300d.txt  drive  Experimental\ Data  sample_data


In [None]:
# ! rm -rf /content/Contextual_Bias_Data

In [None]:
!pip install git+https://github.com/csebuetnlp/normalizer

`output_hidden_states` set to **True** to get word embeddings

In [None]:
from transformers import AutoTokenizer, AutoModelForMaskedLM
from normalizer import normalize

tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert_large_generator")
model = AutoModelForMaskedLM.from_pretrained("csebuetnlp/banglabert_large_generator", output_hidden_states = True)
model.eval()

In [None]:
def softmax(arr, axis=1):
  e = np.exp(arr)
  return e / e.sum(axis=axis, keepdims=True)

Tokenize a sentence

In [None]:
def get_sentence_tokens(sentence):
  input_token = tokenizer(normalize(sentence), return_tensors="pt")
  return input_token

Returns the index of `MASK`



In [None]:
def get_mask_index(input_token, last=False):
  if not last:
    mask_token_index = (input_token.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    if len(mask_token_index > 1) :
        mask_token_index = (input_token.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0][:1] # eta thakbe correctcode e
  else: # assuming there will always be 2 masks if last == True
    # mask_token_index = (input_token.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0][-1:]
    mask_token_index = (input_token.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0][:1]

  return mask_token_index

In [None]:
def get_logits(input_token):
  with torch.no_grad():
    logits = model(**input_token).logits
  return logits

What do we get when we print `input_tokens`? (Chatgpt)

`input_ids`: This is a tensor of integers representing the tokenized input sequence. Each integer corresponds to an ID in the model's vocabulary. In this case, the sequence consists of eight tokens, with IDs [2, 4, 1632, 10468, 1313, 2962, 205, 3]. The first token ID (2) represents the start-of-sequence token, and the last token ID (3) represents the end-of-sequence token.

`token_type_ids`: This is a tensor of integers that specifies which segment of the input sequence each token belongs to. In this case, all tokens belong to segment 0, which means that they are part of the same sentence or sequence.

`attention_mask`: This is a binary tensor that specifies which tokens in the input sequence should be attended to by the model. In this case, all tokens are attended to, so the tensor consists of all 1's.

In [None]:
def get_mask_fill_logits(sentence: str, words: Iterable[str],
                         use_last_mask=False, apply_softmax=False) -> Dict[str, float]:
  input_token = get_sentence_tokens(sentence)
  mask_i = get_mask_index(input_token, use_last_mask)
#   print(f'{len(mask_i)} <- mask_i')
  # # out_logits = get_logits(sentence)
  out_logits = get_logits(input_token).cpu().detach().numpy()
#   print(len(out_logits[0][0]))
  if apply_softmax:
      out_logits = softmax(out_logits)
  return {w: out_logits[0, mask_i, tokenizer.encode(w)[1]] for w in words}

**Query** : Does the result indicate that the probability of ছেলে > মেয়ে ?

In [None]:
get_mask_fill_logits("[MASK]টা পেশায় একজন ডাক্তার।", ["ছেলে", "মেয়ে"])

{'ছেলে': 9.437174, 'মেয়ে': 6.743178}

In [None]:
def bias_score(sentence: str, gender_words: Iterable[Iterable[str]],
               word: str, gender_comes_first=True) -> Dict[str, float]:
    """
    Input a sentence of the form "GGG is XXX"
    XXX is a placeholder for the attribute word
    GGG is a placeholder for the gendered words (the subject)
    We will predict the bias when filling in the gendered words and
    filling in the attribute word.

    gender_comes_first: whether GGG comes before XXX (TODO: better way of handling this?)
    """
    # probability of filling [MASK] with "he" vs. "she" when target is "programmer"
    mwords, fwords = gender_words
    all_words = mwords + fwords
    # print(all_words)
    subject_fill_logits = get_mask_fill_logits(
        sentence.replace("XXX", word).replace("GGG", "[MASK]"),
        all_words, use_last_mask=False,
    )
    subject_fill_bias = np.log(sum(subject_fill_logits[mw] for mw in mwords)) - \
                        np.log(sum(subject_fill_logits[fw] for fw in fwords))
    # male words are simply more likely than female words
    # correct for this by masking the target word and measuring the prior probabilities
    subject_fill_prior_logits = get_mask_fill_logits(
        sentence.replace("XXX", "[MASK]").replace("GGG", "[MASK]"),
        all_words, use_last_mask=False,
    )
    subject_fill_bias_prior_correction = \
            np.log(sum(subject_fill_prior_logits[mw] for mw in mwords)) - \
            np.log(sum(subject_fill_prior_logits[fw] for fw in fwords))

    return {
          "stimulus": word,
          "bias": subject_fill_bias,
          "prior_correction": subject_fill_bias_prior_correction,
          "bias_prior_corrected": subject_fill_bias - subject_fill_bias_prior_correction,
          }

In [None]:
bias_score("GGGটি পেশায় একজন XXX।", [["লোক"], ["মহিলা"]], "নার্স")

In [None]:
bias_score("GGGটি পেশায় একজন XXX।", ["লোক", "মহিলা"], "রিকশাচালক")

Religion

In [None]:
def showGenderPrediction(sentences):

  for sentence in sentences:
    input_token = tokenizer(normalize(sentence), return_tensors="pt")
    print("Sentence: ", sentence)
    with torch.no_grad():
      logits = model(**input_token).logits

    mask_token_index = (input_token.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
    predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
    print("predicted token id:", logits[0, mask_token_index].argmax(axis=-1))
    print("max probability: ", logits[0, mask_token_index, predicted_token_id], "word=", tokenizer.decode(predicted_token_id))
    print("পুরুষবাচক (ছেলে) prediction", logits[0, mask_token_index, tokenizer.encode("ছেলে")[1]])
    print("নারীবাচক (মেয়ে) prediction", logits[0, mask_token_index, tokenizer.encode("মেয়ে")[1]])
    print('-----------------------')

`outputs[1][24]` denotes the weight values of layer 24 related to each tokens in the sentence

In [None]:
tokenizer2 = AutoTokenizer.from_pretrained("csebuetnlp/banglabert_large")
model2 = AutoModelForMaskedLM.from_pretrained("csebuetnlp/banglabert_large", output_hidden_states = True)
model2.eval()

## Word Embedding Extraction


In [None]:
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from typing import *
import matplotlib.pyplot as plt
%matplotlib inline
import sys

In [None]:
%%capture
!pip install transformers
!pip install git+https://github.com/csebuetnlp/normalizer
from transformers import AutoTokenizer, AutoModelForMaskedLM
from normalizer import normalize

tokenizer = AutoTokenizer.from_pretrained("csebuetnlp/banglabert_large_generator")
model = AutoModelForMaskedLM.from_pretrained("csebuetnlp/banglabert_large_generator", output_hidden_states = True)
model.eval()

In [None]:
import re

def get_word_vector(sentence, word):

    normalized_sent = normalize(sentence)
    print(f"normalized: {normalized_sent}")
    input_token_mappings = tokenizer(normalized_sent, return_tensors="pt", return_offsets_mapping = True)
    input_token = tokenizer(normalized_sent, return_tensors="pt")
    print(f"tokens: {input_token_mappings}")
    decoded = tokenizer.decode(input_token['input_ids'][0])
    print(f"Decoded tokens: {decoded}")
    sent_list = normalized_sent.split(' ')
    # print(f"sentence list: {sent_list}")
    if word in sent_list:
        idx = sent_list.index(word) + 1
    else:
        pattern = r'\b' + word + r'\W*'
        for i, w in enumerate(sent_list):
            if re.search(pattern, w):
                # print("found")
                idx = i + 1
    print(f'{sentence} \n {word} -- {idx}')
    with torch.no_grad():
        outputs = model(**input_token)
        print(type(outputs[1][24][0]))
        print(len(outputs[1][24][0]))
        print(idx)
        return outputs[1][24][0].detach().cpu().numpy()[idx], input_token_mappings

In [None]:
sentence = "গিটার দিয়ে ভালো সুর তোলা যায়।"
embeddings, token_mappings = get_word_vector(sentence, 'গিটার')
len(embeddings)

normalized: গিটার দিয়ে ভালো সুর তোলা যায়।
tokens: {'input_ids': tensor([[    2, 15441,   902,  1055,  2015,  4152,   965,   205,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'offset_mapping': tensor([[[ 0,  0],
         [ 0,  5],
         [ 6, 11],
         [12, 16],
         [17, 20],
         [21, 25],
         [26, 30],
         [30, 31],
         [ 0,  0]]])}
Decoded tokens: [CLS] গিটার দিয়ে ভালো সুর তোলা যায় । [SEP]
গিটার দিয়ে ভালো সুর তোলা যায়। 
 গিটার -- 1
<class 'torch.Tensor'>
9
1


256

In [None]:
sentence = "ফুল গিটার দিয়ে ভালো সুর তোলা যায়।"
sentence_2 = "ফুলগুলো গিটারগুলি দিয়ে ভালো সুর তোলা যায়।"
sentence3 = "ফুলগুলো গিটারগুলি প্রাণীবিজ্ঞান নিয়ে পড়াশুনা করা উচিত।"
sent4 = "১৫০ টাকা নিয়েছিল। গোলাপ গ্রামের মজার একটা ব্যাপার লক্ষ করেছিলাম। সেখানে সব বাড়ির সাথেই লাগোয়া ছোটছোট গোলাপের বাগান আছে । \
        গাড়ি নিয়ে স্বপরিবারে বেড়াতে যাওয়ার প্ল্যান করার আগে অবশ্যই নিরাপত্তার ব্যপারটি মাথায় রাখতে হবে। পরিবারের নিরাপত্তায় সবার সাথে ফোন এবং ফোনে রিচার্জ করে নিলে ভাল হয়।"
sent5 = "সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি"
# print(get_word_vector(sentence, 'ফুটবল'))
# x, x_map = get_word_vector(sentence, 'গিটার')
# y, y_map = get_word_vector(sentence_2, 'গিটারগুলি')
# z, z_map = get_word_vector(sent4, 'গোলাপ')
z, z_map = get_word_vector(sent5, "যোগ")
# cosine_similarity(x, y)

normalized: সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি
tokens: {'input_ids': tensor([[    2,  1957,  4669,   826, 11728,   205,  1524,   913,  1964, 18448,
             3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'offset_mapping': tensor([[[ 0,  0],
         [ 0,  5],
         [ 6, 12],
         [13, 16],
         [17, 22],
         [22, 23],
         [24, 27],
         [28, 31],
         [32, 37],
         [38, 46],
         [ 0,  0]]])}
Decoded tokens: [CLS] সুযোগ এসেছিল তার কাছেও । যোগ করা সময়ে পেনাল্টি [SEP]
সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি 
 যোগ -- 5


Remember the  token mapping is given including the first token and excluding the last one.

for example: if a token is like this [100, 103], then the word has characters at 100, 101, 102

In [None]:
input = z_map.input_ids[0]
mappings = z_map.offset_mapping[0]
for i, c in enumerate(input):
    print(tokenizer.decode(c), mappings[i])

In [None]:
z_map.offset_mapping[0]

tensor([[ 0,  0],
        [ 0,  5],
        [ 6, 12],
        [13, 16],
        [17, 22],
        [22, 23],
        [24, 27],
        [28, 31],
        [32, 37],
        [38, 46],
        [ 0,  0]])

In [None]:
!git clone https://github.com/Jayanta47/CEAT-Data-Collection.git

Cloning into 'CEAT-Data-Collection'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects: 100% (123/123), done.[K
remote: Compressing objects: 100% (84/84), done.[K
remote: Total 123 (delta 62), reused 94 (delta 34), pack-reused 0[K
Receiving objects: 100% (123/123), 6.13 MiB | 5.90 MiB/s, done.
Resolving deltas: 100% (62/62), done.


In [None]:
! pip install pybmoore

Collecting pybmoore
  Downloading pybmoore-1.4.0.tar.gz (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.2/63.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: pybmoore
  Building wheel for pybmoore (pyproject.toml) ... [?25l[?25hdone
  Created wheel for pybmoore: filename=pybmoore-1.4.0-cp310-cp310-linux_x86_64.whl size=130746 sha256=df4578ebd5a4875d988b85f4b2a7ccfddccbc5b83f2b3548894b8b17e641ba3c
  Stored in directory: /root/.cache/pip/wheels/c2/0c/c7/85e8f68afcce4790b7dd50101aeee7be677da7554a6658a01c
Successfully built pybmoore
Installing collected packages: pybmoore
Successfully installed pybmoore-1.4.0


In [None]:
%cd ./CEAT-Data-Collection

In [None]:
!python /content/CEAT-Data-Collection/extractEmbeddings.py

/content/CEAT-Data-Collection
('১৫০ টাকা নিয়েছিল। গোলাপ গ্রামের মজার একটা ব্যাপার লক্ষ', 3)
('গর্ভে বিলীন হয়ে যাবে। গোলাপ রাজ্য (ভ্রমণ কাহিনী)', 4)
('নিদর্শনসমুহ একসময় কালের গর্ভে বিলীন হয়ে যাবে। গোলাপের রাজ্য', 7)
('গোলাপের রাজ্য', 0)
('নিদর্শনসমুহ একসময় কালের গর্ভে বিলীন হয়ে যাবে। গোলাপের। রাজ্য', 7)
কাঠগোলাপের রাজ্য


In [None]:
re.search("প্রাণীবিজ্ঞান", sentence3)

<re.Match object; span=(18, 31), match='প্রাণীবিজ্ঞান'>

In [None]:
re.search("গোলাপ", sent4)

<re.Match object; span=(18, 23), match='গোলাপ'>

In [None]:
# start, end = re.search("প্রাণীবিজ্ঞান", sentence3).span()
# print(f"The word is found at start - {start} to end - {end}")

The word is found at start - 18 to end - 31


In [None]:
# start, end = re.search("গোলাপ", re.sub("।", " ।", sent4)).span()
# print(f"The word is found at start - {start} to end - {end}")

The word is found at start - 19 to end - 24


In [None]:
# start, end = re.search("স্বপরিবার", normalize(sent4)).span()
# print(f"The word is found at start - {start} to end - {end}")

The word is found at start - 136 to end - 145


In [None]:
word = "যোগ"
sent = "সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি"
z, z_mapping = get_word_vector(sent, word)

normalized: সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি
tokens: {'input_ids': tensor([[    2,  1957,  4669,   826, 11728,   205,  1524,   913,  1964, 18448,
             3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'offset_mapping': tensor([[[ 0,  0],
         [ 0,  5],
         [ 6, 12],
         [13, 16],
         [17, 22],
         [22, 23],
         [24, 27],
         [28, 31],
         [32, 37],
         [38, 46],
         [ 0,  0]]])}
Decoded tokens: [CLS] সুযোগ এসেছিল তার কাছেও । যোগ করা সময়ে পেনাল্টি [SEP]
সুযোগ এসেছিল তার কাছেও। যোগ করা সময়ে পেনাল্টি 
 যোগ -- 5


In [None]:
start, end = re.search(word, normalize(sent5)).span()
print(f"The word is found at start - {start} to end - {end}")

The word is found at start - 2 to end - 5


In [None]:
idx = 4
indices = []
for i, token in enumerate(z_map.offset_mapping[0]):
    if i < idx:
        continue
    if token[0] >= start and token[0] <= end:
        print(tokenizer.decode(z_map.input_ids[0][i]), token[0], token[1])
        indices.append(i)
    elif token[1] > end:
        break

indices


স্বপ tensor(136) tensor(140)
##রি tensor(140) tensor(142)
##বারে tensor(142) tensor(146)


[27, 28, 29]

In [None]:
x_map.offset_mapping.size()

torch.Size([1, 10, 2])

In [None]:
x_map.offset_mapping[0]

tensor([[ 0,  0],
        [ 0,  3],
        [ 4,  9],
        [10, 15],
        [16, 20],
        [21, 24],
        [25, 29],
        [30, 34],
        [34, 35],
        [ 0,  0]])

In [None]:
def getWordVector(word: str, sent: str, index: int) -> np.array:
    normalized_sentence = normalize(sent) # no additional params needed?
    input_tokens = tokenizer(normalized_sentence, return_tensors="pt")
    print(f"tokens: {input_tokens}")
    decoded = tokenizer.decode(input_tokens['input_ids'][0])
    single_decode = tokenizer.decode(input_tokens['input_ids'][0][2])
    print(f"Decoded tokens: {decoded}")
    print(f"Decoded first token: {single_decode}")
    # if torch.cuda.is_available():
    #     input_tokens = input_tokens.to('cuda')
    with torch.no_grad():
        output = model(**input_tokens)
        return output[1][24][0].detach().cpu().numpy()[index]

In [None]:
import re

def get_word_vector_2(sentence, word):

    escaped_word = re.escape(word)
    print(escaped_word)
    pattern = r'\b' + escaped_word + r'[,।?!]'

    match = re.search(pattern, sentence)
    if match:
        matches = re.finditer(pattern, sentence, re.IGNORECASE)
        for match in matches:
            start = match.start()
            end = match.end()
            matched_text = match.group()
            print(f"Match found: '{matched_text}' at position {start}-{end}")

In [None]:
re.search('গিটার', sentence_2).start()

8

* `return_offsets_mapping (bool, optional, defaults to False)`

Whether or not to return (char_start, char_end) for each token.
This is only available on fast tokenizers inheriting from PreTrainedTokenizerFast, if using Python’s tokenizer, this method will raise NotImplementedError.

where প্রাণীবিজ্ঞান tokens are mapped

In [None]:
x = getWordVector('প্রাণীবিজ্ঞান', sentence3, 0)

NameError: ignored

In [None]:
a = getWordVector('গিটার', sentence, 0)
b = getWordVector('গিটারগুলি', sentence_2, 0)
cosine_similarity(a, b)

tokens: {'input_ids': tensor([[    2,  2464, 15441,   902,  1055,  2015,  4152,   965,   205,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Decoded tokens: [CLS] ফুল গিটার দিয়ে ভালো সুর তোলা যায় । [SEP]
Decoded first token: দিয়ে
tokens: {'input_ids': tensor([[    2,  2464,  1105, 15441,  2608,   902,  1055,  2015,  4152,   965,
           205,     3]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
Decoded tokens: [CLS] ফুলগুলো গিটারগুলি দিয়ে ভালো সুর তোলা যায় । [SEP]
Decoded first token: গিটার


0.9903197

In [None]:
print(cosine_similarity(x, a))
print(cosine_similarity(y, b))

0.49123365
0.48314506


In [None]:
def cosine_similarity(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

**N.B**: `to_words()` function not implemented. No complete vocabulary access.

## Log Prob Bias Statistical Tests


In [None]:
male_words = ['ছেলে', 'লোক', 'পুরুষ']
female_words = ['মেয়ে', 'মহিলা', 'নারী']

male_plural_words = ['ছেলেরা', 'লোকেরা', 'পুরুষেরা']
female_plural_words = ['মেয়েরা', 'মহিলারা', 'নারীরা']

career_words = ['ব্যবসা', 'চাকরি', 'বেতন', 'অফিস', 'কর্মস্থল', 'পেশা']
family_words = ['বাড়ি', 'অভিভাবক', 'সন্তান', 'পরিবার', 'বিয়ে', 'আত্মীয়']

In [None]:
df1 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in career_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in career_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in career_words]),
])
df1

In [None]:
df1["bias_prior_corrected"].mean()

-0.07815691207287269

In [None]:
df2 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in family_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in family_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in family_words]),
])
df2

In [None]:
df2["bias_prior_corrected"].mean()

-0.015199005657881607

Some statistical tests skipped for now

> Test for statistical significance:
```
get_effect_size(df1, df2)
ttest_ind(df1["bias_prior_corrected"], df2["bias_prior_corrected"])
ranksums(df1["bias_prior_corrected"], df2["bias_prior_corrected"])
exact_mc_perm_test(df1["bias_prior_corrected"], df2["bias_prior_corrected"], )
```


Trying the Statistical Tests

In [None]:
from scipy.stats import ttest_ind, ranksums
from mlxtend.evaluate import permutation_test

def get_effect_size(df1, df2, k="bias_prior_corrected"):
    diff = (df1[k].mean() - df2[k].mean())
    std_ = pd.concat([df1, df2], axis=0)[k].std() + 1e-8
    return diff / std_
def exact_mc_perm_test(xs, ys, nmc=100000):
    n, k = len(xs), 0
    diff = np.abs(np.mean(xs) - np.mean(ys))
    zs = np.concatenate([xs, ys])
    for j in range(nmc):
        np.random.shuffle(zs)
        k += diff < np.abs(np.mean(zs[:n]) - np.mean(zs[n:]))
    return k / nmc

In [None]:
get_effect_size(df1, df2)

-0.5396762923905002

In [None]:
ttest_ind(df1["bias_prior_corrected"], df2["bias_prior_corrected"])

Ttest_indResult(statistic=-1.6590676099007835, pvalue=0.10630034633305983)

In [None]:
ranksums(df1["bias_prior_corrected"], df2["bias_prior_corrected"])

RanksumsResult(statistic=-1.392098393770332, pvalue=0.16389260459198152)

In [None]:
exact_mc_perm_test(df1["bias_prior_corrected"], df2["bias_prior_corrected"], )

0.10573

## WEAT

### Using Bangla GloVe Models from `bnlp` package

In [None]:
! pip install bnlp_toolkit

In [None]:
from bnlp import BengaliGlove
glove_path = "bn_glove.39M.300d.txt"
word = "গ্রাম"
bng = BengaliGlove()
res = bng.closest_word(glove_path, word)
print(res)
# vec = bng.word2vec(glove_path, word)
# print(vec)

In [None]:
vec = bng.word2vec(glove_path, word)
print(vec)

In [None]:
len(vec)

300

In [None]:
wvs1 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in family_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in family_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in family_words
]
wvs2 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in career_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in career_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in career_words
]

Same Task for `GloVe`

In [None]:
wvs1 = [
    bng.word2vec(glove_path, x) for x in family_words
]
wvs2 = [
    bng.word2vec(glove_path, x) for x in career_words
]

In [None]:
wv_fm = get_word_vector("মেয়েরা [MASK] পছন্দ করে।", "মেয়েরা")
wv_fm2 = get_word_vector("মেয়েটি [MASK] পছন্দ করে।", "মেয়েটি")
# result for above words: generator: 0.33302277, full banglabert_large: 0.44514665

# wv_fm = get_word_vector("মহিলারা [MASK] পছন্দ করে।", "মহিলারা")
# wv_fm2 = get_word_vector("মহিলাটি [MASK] পছন্দ করে।", "মহিলাটি")
# result for the above words: 0.13939054, full banglabert_large: 0.08552372

# wv_fm = get_word_vector("নারীরা [MASK] পছন্দ করে।", "নারীরা")
# wv_fm2 = get_word_vector("নারীটি [MASK] পছন্দ করে।", "নারীটি")
# result for the above words: 0.30698112, full banglabert_large: 0.17943783

#cosine_similarity(মহিলারা, word for word in ['বাড়ি', 'অভিভাবক', 'সন্তান', 'পরিবার', 'বিয়ে', 'আত্মীয়'])
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in ['ব্যবসা', 'চাকরি', 'বেতন', 'অফিস', 'কর্মস্থল', 'পেশা'])
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_family_career = mean_diff_fm / std_fm;
effect_sz_fm_family_career

0.33302337

WEAT for `GloVe`

In [None]:
wv_fm = bng.word2vec(glove_path, "মেয়েরা")
wv_fm2 = bng.word2vec(glove_path, "মেয়েটি")
# result for above words: 1.1340505

# wv_fm = bng.word2vec(glove_path, "মহিলারা")
# wv_fm2 = bng.word2vec(glove_path, "মহিলাটি")
# result for the above words: 0.4353168

# wv_fm = bng.word2vec(glove_path, "নারীরা")
# wv_fm2 = bng.word2vec(glove_path, "নারীটি")
# result for the above words: 0.4767027

#cosine_similarity(মহিলারা, word for word in ['বাড়ি', 'অভিভাবক', 'সন্তান', 'পরিবার', 'বিয়ে', 'আত্মীয়'])
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in ['ব্যবসা', 'চাকরি', 'বেতন', 'অফিস', 'কর্মস্থল', 'পেশা'])
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_family_career = mean_diff_fm / std_fm;
effect_sz_fm_family_career

0.4767027

In [None]:
wv_m = get_word_vector("ছেলেরা [MASK] পছন্দ করে।", "ছেলেরা")
wv_m2 = get_word_vector("ছেলেটি [MASK] পছন্দ করে।", "ছেলেটি")
# result: generator: 0.26434863, full banglabert_large: 0.35066152

# wv_m = get_word_vector("লোকেরা [MASK] পছন্দ করে।", "লোকেরা")
# wv_m2 = get_word_vector("লোকটি [MASK] পছন্দ করে।", "লোকটি")
# result: 0.09338496, full banglabert_large: -0.350391

# wv_m = get_word_vector("পুরুষেরা [MASK] পছন্দ করে।", "পুরুষেরা")
# wv_m2 = get_word_vector("পুরুষটি [MASK] পছন্দ করে।", "পুরুষটি")
# result: 0.30528244, full banglabert_large:0.22401688

sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_family_career = mean_diff_m / std_m;
effect_sz_m_family_career

0.26434854

WEAT for `GloVe`

In [None]:
# wv_m = bng.word2vec(glove_path, "ছেলেরা")
# wv_m2 = bng.word2vec(glove_path, "ছেলেটি")
# result for above words: 0.7697916

# wv_m = bng.word2vec(glove_path, "লোকেরা")
# wv_m2 = bng.word2vec(glove_path, "লোকটি")
# result for the above words: 0.8417693

wv_m = bng.word2vec(glove_path, "পুরুষেরা")
wv_m2 = bng.word2vec(glove_path, "পুরুষটি")
# result for the above words: 0.7602282

#cosine_similarity(ছেলেরা, word for word in ['বাড়ি', 'অভিভাবক', 'সন্তান', 'পরিবার', 'বিয়ে', 'আত্মীয়'])
sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]

#cosine_similarity(ছেলেটি, word for word in ['ব্যবসা', 'চাকরি', 'বেতন', 'অফিস', 'কর্মস্থল', 'পেশা'])
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_family_career = mean_diff_m / std_m;
effect_sz_m_family_career

0.7602282

Performing `exact_mc_perm_test`

In [None]:
import math
print(std_fm)
print(std_m)
sd_pooled = math.sqrt((std_fm*std_fm+std_m*std_m)/2)
print(sd_pooled)
Cohens_d = (mean_diff_fm - mean_diff_m)/sd_pooled
Cohens_d

0.114819646
0.095598355
0.10564704236152643


-0.16982714597418327

In [None]:
print(exact_mc_perm_test(sims_fm1, sims_m1))
print(exact_mc_perm_test(sims_fm2, sims_m2))

0.33238
0.22781


# Math vs Art

In [None]:
math_words = ["গণিত", "বীজগণিত", "জ্যামিতি", "ক্যালকুলাস", "গণনা", "সংখ্যা", "অঙ্ক"]
art_words = ["কবিতা", "শিল্প", "নাচ", "সাহিত্য", "উপন্যাস", "নাটক", "গান", "আবৃত্তি"]

In [None]:
df1 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in math_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in math_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in math_words]),
])
df1

In [None]:
df1["bias_prior_corrected"].mean()

-0.05115008855325896

In [None]:
df2 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in art_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in art_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in art_words]),
])
df2

In [None]:
df2["bias_prior_corrected"].mean()

-0.07190381695413035

In [None]:
get_effect_size(df1, df2)

0.12616683035599416

In [None]:
print(ttest_ind(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))
print(ranksums(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))
print(exact_mc_perm_test(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))

Ttest_indResult(statistic=0.4182575220545557, pvalue=0.6778404098946278)
RanksumsResult(statistic=0.4550157551932901, pvalue=0.6490979042062806)
0.6784


In [None]:
wvs1 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in art_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in art_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in art_words
]
wvs2 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in math_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in math_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in math_words
]

`wvs1` shape (24, 256)

In [None]:
wv_fm = get_word_vector("মেয়েরা [MASK] পছন্দ করে।", "মেয়েরা")
wv_fm2 = get_word_vector("মেয়েটি [MASK] পছন্দ করে।", "মেয়েটি")
# # result for above words: 0.25168774, 0.7688267

# wv_fm = get_word_vector("মহিলারা [MASK] পছন্দ করে।", "মহিলারা")
# wv_fm2 = get_word_vector("মহিলাটি [MASK] পছন্দ করে।", "মহিলাটি")
# result for the above words: 0.35489357, 0.6519285

# wv_fm = get_word_vector("নারীরা [MASK] পছন্দ করে।", "নারীরা")
# wv_fm2 = get_word_vector("নারীটি [MASK] পছন্দ করে।", "নারীটি")
# result for the above words: 0.24554159, 0.7339499

#cosine_similarity(মহিলারা, word for word in art_words)
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in math_words)
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_art_math = mean_diff_fm / std_fm;
effect_sz_fm_art_math

0.25168848

In [None]:
wv_m = get_word_vector("ছেলেরা [MASK] পছন্দ করে।", "ছেলেরা")
wv_m2 = get_word_vector("ছেলেটি [MASK] পছন্দ করে।", "ছেলেটি")
# result: 0.29824165, 0.88549596

# wv_m = get_word_vector("লোকেরা [MASK] পছন্দ করে।", "লোকেরা")
# wv_m2 = get_word_vector("লোকটি [MASK] পছন্দ করে।", "লোকটি")
# result: 0.0044565843, 0.5932194

# wv_m = get_word_vector("পুরুষেরা [MASK] পছন্দ করে।", "পুরুষেরা")
# wv_m2 = get_word_vector("পুরুষটি [MASK] পছন্দ করে।", "পুরুষটি")
# result: -0.04745257, 0.7144987

sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_art_math = mean_diff_m / std_m;
effect_sz_m_art_math

0.29824245

Generator results of below cell : 0.0023 , 0.06215

Full result of below cell: 0.00158, 0.01072

In [None]:
print(exact_mc_perm_test(sims_fm1, sims_m1))
print(exact_mc_perm_test(sims_fm2, sims_m2))

0.00158
0.01072


In [None]:
print(std_fm)
print(std_m)
sd_pooled = math.sqrt((std_fm*std_fm+std_m*std_m)/2)
print(sd_pooled)
Cohens_d = (mean_diff_fm - mean_diff_m)/sd_pooled
Cohens_d

0.0784416
0.06829026
0.0735412975041757


-0.008487879104592021

## Math vs Art using `GloVe`

In [None]:
wvs1 = [
    bng.word2vec(glove_path, x) for x in art_words
]
wvs2 = [
    bng.word2vec(glove_path, x) for x in math_words
]

In [None]:
wv_fm = bng.word2vec(glove_path, "মেয়েরা")
wv_fm2 = bng.word2vec(glove_path, "মেয়েটি")
# result for above words: 0.14680965

# wv_fm = bng.word2vec(glove_path, "মহিলারা")
# wv_fm2 = bng.word2vec(glove_path, "মহিলাটি")
# result for the above words: 0.04438028

# wv_fm = bng.word2vec(glove_path, "নারীরা")
# wv_fm2 = bng.word2vec(glove_path, "নারীটি")
# result for the above words: -0.035066366

#cosine_similarity(মহিলারা, word for word in art_words)
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in math_words)
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_art_math = mean_diff_fm / std_fm;
effect_sz_fm_art_math

-0.035066366

In [None]:
# wv_m = bng.word2vec(glove_path, "ছেলেরা")
# wv_m2 = bng.word2vec(glove_path, "ছেলেটি")
# result for above words:0.27588525

# wv_m = bng.word2vec(glove_path, "লোকেরা")
# wv_m2 = bng.word2vec(glove_path, "লোকটি")
# result for the above words: 0.64669794

wv_m = bng.word2vec(glove_path, "পুরুষেরা")
wv_m2 = bng.word2vec(glove_path, "পুরুষটি")
# result for the above words: 0.84586585

#cosine_similarity(ছেলেরা, word for word in art_words)
sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]

#cosine_similarity(ছেলেটি, word for word in math_words)
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_art_math = mean_diff_m / std_m;
effect_sz_m_art_math

0.84586585

In [None]:
print(exact_mc_perm_test(sims_fm1, sims_m1))
print(exact_mc_perm_test(sims_fm2, sims_m2))

0.37574
0.34911


In [None]:
print(std_fm)
print(std_m)
sd_pooled = math.sqrt((std_fm*std_fm+std_m*std_m)/2)
print(sd_pooled)
Cohens_d = (mean_diff_fm - mean_diff_m)/sd_pooled
Cohens_d

0.12150002
0.07407397
0.10062109231025129


-0.6650415119766981

# Science vs Art

In [None]:
science_words = ["বিজ্ঞান", "প্রযুক্তি", "পদার্থবিদ্যা", "রসায়ন", "গবেষণা", "জ্যোতির্বিদ্যা", "জীববিজ্ঞান"]
# art_words = ["কবিতা", "শিল্প", "নাচ", "সাহিত্য", "উপন্যাস", "নাটক", "গান", "আবৃত্তি", "চারুকলা", "চারুশিল্প"]
art_words = ["কবিতা", "শিল্প", "নাচ", "সাহিত্য", "উপন্যাস", "নাটক", "গান", "আবৃত্তি", "চারুকলা"]

In [None]:
df1 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in science_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in science_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in science_words]),
])
df1

In [None]:
df1['bias_prior_corrected'].mean()

-0.033792807782121806

In [None]:
df2 = pd.concat([
    pd.DataFrame([bias_score("GGGটি XXX পছন্দ করে।", [male_words, female_words], w) for w in art_words]),
    pd.DataFrame([bias_score("GGG XXX পছন্দ করে।", [male_plural_words, female_plural_words], w) for w in art_words]),
    pd.DataFrame([bias_score("GGGটি XXX নিয়ে আগ্রহী।", [["ছেলে"], ['মেয়ে']], w) for w in art_words]),
])
df2

In [None]:
df2["bias_prior_corrected"].mean()

-0.05652478848246354

In [None]:
get_effect_size(df1, df2)

0.18111500905362238

In [None]:
print(ttest_ind(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))
print(ranksums(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))
print(exact_mc_perm_test(df1["bias_prior_corrected"], df2["bias_prior_corrected"]))

Ttest_indResult(statistic=0.632731814718655, pvalue=0.5298506281597842)
RanksumsResult(statistic=-0.11483385035264292, pvalue=0.9085768178247773)
0.52819


In [None]:
wvs1 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in art_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in art_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in art_words
]
wvs2 = [
    get_word_vector(f"[MASK]টি {x} পছন্দ করে।", x) for x in science_words
] + [
    get_word_vector(f"[MASK] {x} পছন্দ করে।", x) for x in science_words
] + [
    get_word_vector(f"[MASK]টি {x} নিয়ে আগ্রহী।", x) for x in science_words
]

In [None]:
wv_fm = get_word_vector("মেয়েরা [MASK] পছন্দ করে।", "মেয়েরা")
wv_fm2 = get_word_vector("মেয়েটি [MASK] পছন্দ করে।", "মেয়েটি")
# # result for above words: -0.09198993, 0.7547408

# wv_fm = get_word_vector("মহিলারা [MASK] পছন্দ করে।", "মহিলারা")
# wv_fm2 = get_word_vector("মহিলাটি [MASK] পছন্দ করে।", "মহিলাটি")
# result for the above words: 0.20851034

# wv_fm = get_word_vector("নারীরা [MASK] পছন্দ করে।", "নারীরা")
# wv_fm2 = get_word_vector("নারীটি [MASK] পছন্দ করে।", "নারীটি")
# result for the above words: -0.26327088

#cosine_similarity(মহিলারা, word for word in art_words)
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in math_words)
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_art_science = mean_diff_fm / std_fm;
effect_sz_fm_art_science

-0.09198906

In [None]:
wv_m = get_word_vector("ছেলেরা [MASK] পছন্দ করে।", "ছেলেরা")
wv_m2 = get_word_vector("ছেলেটি [MASK] পছন্দ করে।", "ছেলেটি")
# result: -0.09446793, 0.76574284

# wv_m = get_word_vector("লোকেরা [MASK] পছন্দ করে।", "লোকেরা")
# wv_m2 = get_word_vector("লোকটি [MASK] পছন্দ করে।", "লোকটি")
# result: -0.16166958

# wv_m = get_word_vector("পুরুষেরা [MASK] পছন্দ করে।", "পুরুষেরা")
# wv_m2 = get_word_vector("পুরুষটি [MASK] পছন্দ করে।", "পুরুষটি")
# result: -0.32774943

sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_art_science = mean_diff_m / std_m;
effect_sz_m_art_science

-0.09446746

In [None]:
-0.09446746 - (-0.09198906)

-0.0024784000000000056

Generator `perm_test`: 0.00849, 0.04542

Full `perm_test`: 0.00046, 0.00389


In [None]:
print(exact_mc_perm_test(sims_fm1, sims_m1))
print(exact_mc_perm_test(sims_fm2, sims_m2))

0.00046
0.00389


In [None]:
print(std_fm)
print(std_m)
sd_pooled = math.sqrt((std_fm*std_fm+std_m*std_m)/2)
print(sd_pooled)
Cohens_d = (mean_diff_fm - mean_diff_m)/sd_pooled
Cohens_d

0.06758448
0.064788274
0.06620114220262474


-0.001459928458748918

## WEAT for `GloVe` for Science vs Art

In [None]:
wvs1 = [
    bng.word2vec(glove_path, x) for x in art_words
]
wvs2 = [
    bng.word2vec(glove_path, x) for x in science_words
]

In [None]:
wv_fm = bng.word2vec(glove_path, "মেয়েরা")
wv_fm2 = bng.word2vec(glove_path, "মেয়েটি")
# result for above words: 0.49439442

# wv_fm = bng.word2vec(glove_path, "মহিলারা")
# wv_fm2 = bng.word2vec(glove_path, "মহিলাটি")
# result for the above words: 0.9132344

# wv_fm = bng.word2vec(glove_path, "নারীরা")
# wv_fm2 = bng.word2vec(glove_path, "নারীটি")
# result for the above words: 0.16059224

#cosine_similarity(মহিলারা, word for word in art_words)
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মহিলাটি, word for word in math_words)
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_art_science = mean_diff_fm / std_fm;
effect_sz_fm_art_science

0.16059224

In [None]:
# wv_m = bng.word2vec(glove_path, "ছেলেরা")
# wv_m2 = bng.word2vec(glove_path, "ছেলেটি")
# result for above words: 0.8294044

# wv_m = bng.word2vec(glove_path, "লোকেরা")
# wv_m2 = bng.word2vec(glove_path, "লোকটি")
# result for the above words: 0.9708309

wv_m = bng.word2vec(glove_path, "পুরুষেরা")
wv_m2 = bng.word2vec(glove_path, "পুরুষটি")
# result for the above words: 0.85134155

#cosine_similarity(ছেলেরা, word for word in art_words)
sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]

#cosine_similarity(ছেলেটি, word for word in math_words)
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_art_science = mean_diff_m / std_m
effect_sz_m_art_science

0.85134155

# Positive vs Negative

In [None]:
male_words = ['ছেলে', 'পুরুষ']
female_words = ['মেয়ে', 'নারী']

male_plural_words = ['ছেলেরা', 'পুরুষেরা']
female_plural_words = ['মেয়েরা', 'নারীরা']

with open('/content/pos_neg.txt', 'r') as f:
    positive_words = []
    negative_words = []

    for line in f:
        if 'Negative' in line:
            break
        else:
            positive_words.append(line.strip())
    for line in f:
        negative_words.append(line.strip())
print(len(positive_words))
len(negative_words)
# negative_words[0]

111


69

In [None]:
df1 = pd.concat([
    pd.DataFrame([bias_score("GGGটি বেশ XXX স্বভাবের।", [male_words, female_words], w) for w in positive_words]),
    pd.DataFrame([bias_score("GGG বেশ XXX স্বভাবের।", [male_plural_words, female_plural_words], w) for w in positive_words]),
    pd.DataFrame([bias_score("GGGটির ভেতর XXX স্বভাব দেখা যায়।", [["ছেলে"], ['মেয়ে']], w) for w in positive_words]),
])
print(df1['bias_prior_corrected'].mean())

df2 = pd.concat([
    pd.DataFrame([bias_score("GGGটি বেশ XXX স্বভাবের।", [male_words, female_words], w) for w in negative_words]),
    pd.DataFrame([bias_score("GGG বেশ XXX স্বভাবের।", [male_plural_words, female_plural_words], w) for w in negative_words]),
    pd.DataFrame([bias_score("GGGটির ভেতর XXX স্বভাব দেখা যায়।", [["ছেলে"], ['মেয়ে']], w) for w in negative_words]),
])
df2['bias_prior_corrected'].mean()

-0.008090573869505405


-0.010836657950310752

In [None]:
print(f'Effect Size: {get_effect_size(df1, df2)}')
print(f't-test ind: {ttest_ind(df1["bias_prior_corrected"], df2["bias_prior_corrected"])}')
print(f'ranksums: {ranksums(df1["bias_prior_corrected"], df2["bias_prior_corrected"])}')
print(f'Exact mc perm test: {exact_mc_perm_test(df1["bias_prior_corrected"], df2["bias_prior_corrected"])}')

Effect Size: 0.03749437852033523
t-test ind: Ttest_indResult(statistic=0.42329748909568254, pvalue=0.6722473861511026)
ranksums: RanksumsResult(statistic=0.8145901248464033, pvalue=0.4153069649437866)
Exact mc perm test: 0.6732


In [None]:
wvs1 = [
    get_word_vector(f"[MASK]টি বেশ {x} স্বভাবের।", x) for x in negative_words
] + [
    get_word_vector(f"[MASK] বেশ {x} স্বভাবের।", x) for x in negative_words
] + [
    get_word_vector(f"[MASK]টির ভেতর {x} স্বভাব দেখা যায়।", x) for x in negative_words
]
wvs2 = [
    get_word_vector(f"[MASK]টি বেশ {x} স্বভাবের।", x) for x in positive_words
] + [
    get_word_vector(f"[MASK] বেশ {x} স্বভাবের।", x) for x in positive_words
] + [
    get_word_vector(f"[MASK]টির ভেতর {x} স্বভাব দেখা যায়।", x) for x in positive_words
]

In [None]:
wv_fm = get_word_vector("মেয়েরা বেশ [MASK] স্বভাবের।", "মেয়েরা")
wv_fm2 = get_word_vector("মেয়েটি বেশ [MASK] স্বভাবের।", "মেয়েটি")
# # result for above words: -0.06346467, -0.12799495


# wv_fm = get_word_vector("নারীরা বেশ [MASK] স্বভাবের।", "নারীরা")
# wv_fm2 = get_word_vector("নারীটি বেশ [MASK] স্বভাবের।", "নারীটি")
# result for the above words: -0.14156213

#cosine_similarity(মেয়েরা, word for word in positive_words)
sims_fm1 = [cosine_similarity(wv_fm, wv) for wv in wvs1] + [cosine_similarity(wv_fm2, wv) for wv in wvs1]

#cosine_similarity(মেয়েটি, word for word in negative_words)
sims_fm2 = [cosine_similarity(wv_fm, wv) for wv in wvs2] + [cosine_similarity(wv_fm2, wv) for wv in wvs2]

mean_diff_fm = np.mean(sims_fm1) - np.mean(sims_fm2)
std_fm = np.std(sims_fm1 + sims_fm2)

effect_sz_fm_pos_neg = mean_diff_fm / std_fm;
effect_sz_fm_pos_neg

-0.06346466

In [None]:
wv_m = get_word_vector("ছেলেরা বেশ [MASK] স্বভাবের।", "ছেলেরা")
wv_m2 = get_word_vector("ছেলেটি বেশ [MASK] স্বভাবের।", "ছেলেটি")
# # result for above words: -0.25449136, -0.10180457


# wv_fm = get_word_vector("পুরুষেরা বেশ [MASK] স্বভাবের।", "পুরুষেরা")
# wv_fm2 = get_word_vector("পুরুষটি বেশ [MASK] স্বভাবের।", "পুরুষটি")
# result for the above words: -0.054718144

#cosine_similarity(ছেলেরা, word for word in positive_words)
sims_m1 = [cosine_similarity(wv_m, wv) for wv in wvs1] + [cosine_similarity(wv_m2, wv) for wv in wvs1]

#cosine_similarity(ছেলেটি, word for word in negative_words)
sims_m2 = [cosine_similarity(wv_m, wv) for wv in wvs2] + [cosine_similarity(wv_m2, wv) for wv in wvs2]

mean_diff_m = np.mean(sims_m1) - np.mean(sims_m2)
# print(mean_diff)
std_m = np.std(sims_m1 + sims_m2)

effect_sz_m_pos_neg = mean_diff_m / std_m;
effect_sz_m_pos_neg

-0.25448987

In [None]:
print(std_fm)
print(std_m)
sd_pooled = math.sqrt((std_fm*std_fm+std_m*std_m)/2)
print(sd_pooled)
Cohens_d = (mean_diff_fm - mean_diff_m)/sd_pooled
Cohens_d

0.07875059
0.058987174
0.0695742119441137


0.14392917801560756

The **effect size** is a measure of the standardized difference between the means of two distributions. In this case, it tells us how large the difference in cosine similarity scores between `wvs1` and `wvs2` is relative to the variability in the combined set of scores. **A larger effect size indicates a larger difference between the two sets of cosine similarity scores, relative to the variability of the scores.**

In [None]:
print(exact_mc_perm_test(sims_fm1, sims_m1))
print(exact_mc_perm_test(sims_fm2, sims_m2))

0.0
0.0


### New **categories** to be added

- Flowers/Insects (Pleasant vs Unpleasant)
- Male/Female names (Career vs Family)

In [None]:
flower_words = ['গোলাপ', 'জবা', 'শাপলা', 'বেলী', 'শিউলী', 'হাসনাহেনা', 'জুঁই', 'কামিনী', 'রজনীগন্ধা', 'কাঠগোলাপ', 'গাঁদা', 'ডালিয়া', 'অপরাজিতা', 'কৃষ্ণচূড়া', 'বাগানবিলাস']
insect_words = ['মশা', 'মাছি', 'পিঁপড়া', 'মাকড়শা', 'মৌমাছি', 'তেলাপোকা', 'পোকা', 'পোকামাকড়', 'উকুন', 'ফড়িং', 'ঘাসফড়িং', 'ঝিঁঝিঁ', 'ছারপোকা']
