# Evaluating and Mitigating Fairness in Language Models used in Financial Services

In this report, we map the current bias evaluation and mitigation approaches of LLMs and discuss their applicability in financial services use. We discuss various strategies that is offered for different application areas to identify and reduce biases in language models and integrate these different strategies into financial services. Given the increasing interest for integrating language models in decision-making processes, from credit scoring to fraud detection, understanding and addressing biases is imperative. This review not only highlights the existing methodologies but also explores their effectiveness and limitations within the context of financial applications. Our goal is to offer actionable insights and recommendations that can guide industry practitioners in implementing more equitable and robust AI systems.

**Definition of Bias:** 
It is use-case, context and culture dependent. Bias is a systematic and unfair deviation in data or algorithms that leads to inaccurate or prejudiced outcomes. A biased system can potentially harm both users and developers in various ways.

**Bias in text data – financial services context:**
- In financial services, data are commonly differ between geographical areas. So, developing a universal model is often tricky, since the data cannot be shared among silos. 
- Text data often includes numerical relations, which can also include some hidden bias. So, understanding the relation between entities and exploring these relations in a human-in-the-loop approach is critical.


In [None]:
# In this notebook, we use examples from Gallegos et al.'s fair LLM benchmark which is a compilation of publicly-available bias evaluation datasets.
!git clone https://github.com/i-gallegos/Fair-LLM-Benchmark.git ../data/Fair-LLM-Benchmark

In [None]:
import os
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from datasets import load_dataset
import torch
from transformers import AutoTokenizer, AutoModel

In [None]:
import sys
# caution: path[0] is reserved for script path (or '' in REPL)
sys.path.insert(1, './utils/')
from counterfactual_generator import generate_random_counterfactual, generate_counterfactuals

## Evaluation with Text Prompts

### Counterfactuals

Counterfactuals are hypothetical scenarios used to analyse what would happen if a certain event or condition were different from what actually occurred.

In bias evaluation, we can use counterfactuals to assess how changes in certain variables (e.g., gender, race) affect outcomes. It can help us to identify and measure biases in algorithms by comparing results across different hypothetical scenarios.

The current research utilises counterfactuals in two ways: Masked tokens or unmasked tokens.

#### Masked Tokens

- **Step 1:** Mask a token in a sentence that could reveal bias, such as a gender-specific word (e.g., "he" or "she").
- **Step 2:**  Replace the masked token with different alternatives to create counterfactual scenarios (e.g., replacing "he" with "she").
- **Step 3:**  Analyse the model's predictions for each counterfactual scenario to see if the output changes significantly, indicating potential bias.

Alternatively, we can assign LLMs “fill-mask” tasks to check if the model fills the masked token with a stereotypical instance. The aim is to obtain probability of the models to generate the which masked token to produce.

**Some commonly used datasets:**
-	Winobias, Winobias+, Winogender, GAP, GAP-Subjective, Bugs

In [None]:
# Let's explore some samples from Winobias dataset
winobias_path = '../data/Fair-LLM-Benchmark/WinoBias/data'
winobias_df = pd.read_table(os.path.join(winobias_path, 'anti_stereotyped_type1.txt.dev'), header=None, names=['sentence'])

print(winobias_df["sentence"][0])

In a general-purpose language modelling task, it is easier to define the masked words. We can use EHRC’s definition on protected attributes: Age, disability, gender reassignment, pregnancy and maternity (which includes breastfeeding), race, religion or belief, sex, sexual orientation.

However, when we develop an NLP application for a financial service, the bias in the input data is not explicit. The geographical and cultural biases can be hidden in technical jargon. So, masked tokens have limited application areas when we evaluate finance-LMs. These can be useful when we develop applications that assess more individual financial reports and generate insights from cultural and political news.

#### Unmasked Tokens

- **Step 1:** Identify tokens that may introduce bias, such as gender, race, or other demographic indicators.
- **Step 2:** Construct new sentences by changing these tokens to their counterfactual equivalents (e.g., changing "John" to "Jane").
- **Step 3:** Compare the model’s responses or scores for the original and counterfactual sentences to evaluate differences in treatment or outcomes.

Some commonly used datasets:
-	CrowS-Pairs, Equity Evaluation Corpus, RedditBias, HolisticBias

The aim is to evaluate which sentence is more likely to be produced by the model.


In [None]:
# Let's explore some samples from CrowS-Pairs dataset

crows_pairs_path = '../data/Fair-LLM-Benchmark/CrowS-Pairs/data'
crows_pairs_df = pd.read_csv(os.path.join(crows_pairs_path, 'crows_pairs_anonymized.csv'))
crows_pairs_df.head()

In [None]:
print(crows_pairs_df["sent_more"][0])
print(crows_pairs_df["sent_less"][0])

### Sentence Completion

In the sentence completion, some datasets include potential stereotypical examples like *"The nurse said she would..."* and pairs them with counterfactual variations of the prompts *"The nurse said he would..."*. Alternatively, some datasets aims to match the distribution of human written text to model potentially untargeted biases. 

In [None]:
# One of the sentence completion task included in the benchmark is BOLD
# Let's explore some samples from BOLD dataset
bold_path = '../data/Fair-LLM-Benchmark/BOLD/data'
bold_json = pd.read_json(os.path.join(bold_path, 'prompts/profession_prompt.json'))
bold_json.keys()

In [None]:
bold_json["metalworking_occupations"]

### Question Answering

In the question-answering task, they formulate questions that could reveal bias (e.g., "Who is more likely to be a doctor, John or Jane?").

In [None]:
# One popular dataset used for QA task is BBQ, it also includes templates for QA which you can use to customize your own dataset
# Let's explore some samples from BBQ dataset
bbq_path = '../data/Fair-LLM-Benchmark/BBQ/data'
bbq_json = pd.read_json(os.path.join(bbq_path, 'Age.jsonl'), lines=True)
bbq_json.head()

### Shared steps among these datasets

- Define a task for the model to identify potentially different answers and record them.
- Compare the answers for the original and counterfactual questions to assess any biased responses.
- Check for consistency and fairness in the model's answers across different demographic groups.
- Develop metrics to quantify bias, such as measuring the probability of specific completions or the frequency of certain answers.
- Compare these metrics across different demographics to identify patterns of bias.
- Use the analysis to understand the model's behaviour and identify areas where it exhibits bias.
- Provide insights into the types and sources of bias present in the model.

#### Limitations and adapting these text-based datasets to financial tasks

We can list four main limitations:
1.	Limited cover and context
2.	Unclear formulation of the stereotypes and power imbalances,
3.	Inconsistent and unrelated perturbations of the protected attribute groups.
4.	Predicting their value for downstream applications are often challenging. (We also observed while fine-tuning FairBERTa model.)

In financial services applications, we generally do not observe bias explicitly. So, it is challenging to find an individual word sequence and change it with counterfactual alternatives. Below, we investigated some possible approaches that we can follow to generate counterfactuals in different financial use cases.

We also need to consider whether keeping the bias in currency, country or geography is a valuable pattern based on the financial use case. For example, if EUR demonstrates an instability and negative sentiment based on last three month’s financial news, getting some negative sentiment bias towards EUR might be desirable. But, keeping a long history might result in negative outcomes such as not being able to demonstrate a positive outlook if the currency is historically unstable, but recently demonstrated strong stability.

## Evaluation with Word/Sentence Embeddings

We can also use embedding-based metrics to evaluate the model behaviour.

```md
From StackOverflow: <https://stackoverflow.com/questions/76926025/sentence-embeddings-from-llama-2-huggingface-opensource>

You need to check if the produced sentence embeddings are meaningful, this is required because the model you are using wasn't trained to produce meaningful sentence embeddings.
```

The below code snippet is the application of weighted-mean-pooling. This method involves calculating a weighted average of the input values, where different input values contribute differently to the output based on their assigned weights.


In [None]:
model_id = "yiyanghkust/finbert-pretrain"

m = AutoModel.from_pretrained(model_id,num_labels=3)
t = AutoTokenizer.from_pretrained(model_id)
t.add_special_tokens({'pad_token': '[PAD]'})
m.eval()

In [None]:
texts = ['growth is strong and we have plenty of liquidity.', 
         'there is a shortage of capital, and we need extra financing.'
]
t_input = t(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True).hidden_states[-1]


weights_for_non_padding = t_input.attention_mask * torch.arange(start=1, end=last_hidden_state.shape[1] + 1).unsqueeze(0)

sum_embeddings = torch.sum(last_hidden_state * weights_for_non_padding.unsqueeze(-1), dim=1)
num_of_none_padding_tokens = torch.sum(weights_for_non_padding, dim=-1).unsqueeze(-1)
sentence_embeddings = sum_embeddings / num_of_none_padding_tokens

print(t_input.input_ids)
print(weights_for_non_padding)
print(num_of_none_padding_tokens)
print(sentence_embeddings.shape)

In [None]:
cosine_similarity(sentence_embeddings[0].unsqueeze(0), sentence_embeddings[1].unsqueeze(0))

In [None]:
from angle_emb import AnglE, Prompts
from angle_emb.utils import cosine_similarity


angle = AnglE.from_pretrained('yiyanghkust/finbert-tone', pooling_strategy='cls').cuda()
# For retrieval tasks, we use `Prompts.C` as the prompt for the query when using UAE-Large-V1 (no need to specify prompt for documents).
# When specify prompt, the inputs should be a list of dict with key 'text'
qv = angle.encode({'text': 'growth is strong and we have plenty of liquidity.'}, to_numpy=True, prompt=Prompts.C)
doc_vecs = angle.encode(texts, to_numpy=True)

for dv in doc_vecs:
    print(cosine_similarity(qv[0], dv))


Another approach is to use a specific prompt to generate a contextualized embedding of the last token. Jiang et al. introduced this method and demonstrated good results with the OPT model family without finetuning. The technique involves prompting the model to predict a single word. They named it PromptEOL and used the following implementation in their experiments:

"This sentence: {text} means in one word:"

In [None]:
texts = [
    "this is a test",
    "this is another test case with a different length",
]
prompt_template = "This sentence: {text} means in one word:"
texts = [prompt_template.format(text=x) for x in texts]

t_input = t(texts, padding=True, return_tensors="pt")

with torch.no_grad():
    last_hidden_state = m(**t_input, output_hidden_states=True, return_dict=True).hidden_states[-1]
  
idx_of_the_last_non_padding_token = t_input.attention_mask.bool().sum(1)-1
sentence_embeddings = last_hidden_state[torch.arange(last_hidden_state.shape[0]), idx_of_the_last_non_padding_token]

print(idx_of_the_last_non_padding_token)
print(sentence_embeddings.shape)

#### Limitations and adapting them to financial tasks

#### Next Steps and Open Questions

## Mitigation

### Data Augmentation and Balancing
-	The aim is to neutralise bias by adding/removing/modifying training data samples so that any underrepresented sensitive attribute can be balanced through distribution.
-	A good solution combines multiple data pre-processing techniques: Augmentation, filtering, balancing. For example, if the dataset includes “He worked as an engineer.”, “All women are @&!” and “I am a European author.”:
    - **Augment:** “She worked as an engineer.” (To not associate the gender with occupation.)
    - **Filter:** “All women are @&!” (Eliminate toxicity.)
    - **Balance:** “I am an African author.” (Reweight majority/minority instance)
-	Using counterfactuals is one of the most popular approaches in the data augmentation process.


### How can we use these strategies in a finance-NLP task?

One way to improve fairness is by introducing counterfactual inputs to reduce the impact of protected attributes on the classification decision. For example, if the currency "EUR" biases the model towards a "positive" prediction, we can generate more samples with various currencies. For instance:

Original sentence: "For the last quarter of 2010, Componenta's net sales doubled to EUR131m from EUR76m for the same period a year earlier, while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m."
Sentiment: Positive

If all sentences with the EUR currency are labeled as positive, the model might incorrectly associate the occurrence of EUR with positivity. To mitigate this issue, we can introduce the same dataset instance with different currencies from around the world.

In [41]:
sentence = "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."
vocab_path =  "./utils/codes-all.csv"
target = "AlphabeticCode"
example_cf = generate_random_counterfactual(sentence, vocab_path, target)
example_cf

'For the last quarter of 2010 Componenta s net sales doubled to xof 131 m from mxn 76 m for the same period a year earlier while it moved to a zero pre tax profit from a pre tax loss of inr 7 m'

#### Limitations and adapting them to financial tasks

#### Next Steps and Open Questions

### Projection-based Mitigation

The intuition behind this method is finding the representative latent space that corresponds to the sensitive attributes and transform them in the direction that reduces the bias.

- Ref 1: https://arxiv.org/pdf/2111.00526 (Using Embeddings in Financial Sentiment)
- Ref 2: https://github.com/shauli-ravfogel/nullspace_projection (Use this mitigation approach. It utilises Sentence BERT embeddings. We can use https://huggingface.co/sentence-transformers)
- Alt-Ref 2: https://github.com/technion-cs-nlp/igbp_nonlinear-removal

#### Limitations and adapting them to financial tasks

#### Next Steps and Open Questions

## Discussion

Existing approaches:
-	Ambiguity: It is a good approach to understand bias when the sensitive attribute appears explicitly in a sentence.
-	Validity: Testing against the existing datasets/methods often lack a comprehensive review of bias.
-	Limited generalizability: The diversity is based on the dataset creation methodology.

In NLP-based financial application, fairness could not be always measured by the generative performance. The type of the algorithm can also increase the complexity of fairness:

-	In a graph-based system in modelling, the fairness of defining impact and influence of nodes is a challenging task.


#### Next Steps and Open Questions

Can we generate counterfactuals in sentence level? Changing the whole structure and wording can also be useful to understand implicit bias.