# Bias in Text Embeddings

The text embedding models that represent words or phrases as numerical vectors in machine learning can also contain bias. These biases reflect societal prejudices and stereotypes, and when used in AI systems, they can lead to biased outcomes in various applications like search engines and recommendation systems. Addressing this bias is crucial for creating fair AI systems, and it requires ongoing research and conscious efforts during model development.

In [2]:
import torch
from transformers import AutoTokenizer, AutoModel

In [9]:
model_id = "yiyanghkust/finbert-pretrain"

m = AutoModel.from_pretrained(model_id,num_labels=3)
t = AutoTokenizer.from_pretrained(model_id)
t.add_special_tokens({'pad_token': '[PAD]'})
m.eval()



BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30873, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [10]:
texts = ['growth is strong and we have plenty of liquidity.', 
         'there is a shortage of capital, and we need extra financing.'
]
t_input = t(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True).hidden_states[-1]


weights_for_non_padding = t_input.attention_mask * torch.arange(start=1, end=last_hidden_state.shape[1] + 1).unsqueeze(0)

sum_embeddings = torch.sum(last_hidden_state * weights_for_non_padding.unsqueeze(-1), dim=1)
num_of_none_padding_tokens = torch.sum(weights_for_non_padding, dim=-1).unsqueeze(-1)
sentence_embeddings = sum_embeddings / num_of_none_padding_tokens

print(t_input.input_ids)
print(weights_for_non_padding)
print(num_of_none_padding_tokens)
print(sentence_embeddings.shape)


tensor([[   3,   64,   17,  253,    8,   13,   29, 9146,    7,  466,   48,    4,
            0,    0,    0],
        [   3,  112,   17,   11, 8371,    7,   65,  585,    8,   13,  573, 3980,
          411,   48,    4]])
tensor([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12,  0,  0,  0],
        [ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15]])
tensor([[ 78],
        [120]])
torch.Size([2, 768])


In [11]:
from sklearn.metrics.pairwise import cosine_similarity

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [14]:
cosine_similarity(sentence_embeddings[0].unsqueeze(0), sentence_embeddings[1].unsqueeze(0))

array([[0.84771043]], dtype=float32)

In [15]:
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('yiyanghkust/finbert-tone', pooling_strategy='cls').cuda()
# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).
# When specify prompt, the inputs should be a list of dict with key 'text'
qv = angle.encode({'text': 'growth is strong and we have plenty of liquidity.'}, to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode(texts, to_numpy=True)

for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))

  warn("The installed version of bitsandbytes was compiled without GPU support. "


'NoneType' object has no attribute 'cadam32bit_grad_fp32'


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


0.9602243900299072
0.2621897757053375


Possible Other Libraries that we can use:

- <https://sbert.net/docs/sentence_transformer/usage/semantic_textual_similarity.html>
- 