# Use Case: Monitoring Bias in Financial Sentiment Analysis Task

In this notebook, we will simulate a developer-point-of-view process to create a financial sentiment analysis pipeline and explore the ways to monitor and mitigate biases in the pipeline using FAID. 

1. Use a pre-trained model: FinBERT. Save the model config and sample data for future referencing.
2. Evaluate the classification fairness using fairlearn.
3. Bias mitigation with data-augmentation (counterfactuals).

# 1. Use pre-trained FinBERT

FinBERT is one of the early applications of general-capability transformer-based language models (BERT, GPT, etc.) in the financial domain. It is still relevant and used by practitioners and researchers. We will download the model from <https://huggingface.co/yiyanghkust/finbert-tone>

In [1]:
# tested in transformers==4.18.0 
from transformers import AutoTokenizer, BertTokenizer, AutoModel, BertForSequenceClassification, BertConfig, pipeline, utils
from tqdm import tqdm
import torch
from faid.faidlog import faidlog

project_name = "financial-sentiment-analysis"

The model is fine-tuned on 10,000 manually annotated sentences from analyst reports of S&P 500 firms.

**Input**: A financial text.
**Output**: Positive, Neutral or Negative.

In [13]:
finbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3, output_attentions=True)
#model = AutoModel.from_pretrained('yiyanghkust/finbert-tone',num_labels=3, output_attentions=True)
model = AutoModel.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)
tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')
atokenizer = AutoTokenizer.from_pretrained('yiyanghkust/finbert-tone')

config = BertConfig.from_pretrained('yiyanghkust/finbert-tone')



Initiate a new FAID project to record this bias evaluation and mitigation experiment. You can use this logged metadata to create/fill four report types:
1. **Data Card:** Compatible with Croissant and Google Datacard
2. **Model Card:** Compatible with Tensorflow Model Card Generator
3. **Fairness Report:** A unique report for your use case
4. **Risk Register:** RAID type reporting. Integratable to Github Actions.

In [14]:
# Start the log YAML file, you can see the output in the log folder in the root directory
faidlog.init("financial-sentiment-analysis", model=model)
# alternatively you can pass the config file as a dictionary
# faidlog.init("financial-sentiment-analysis", config=config.to_dict())
# alternatively, if your config file is a custom you can pass it as a dictionary along with the model
# faidlog.init("financial-sentiment-analysis", model=model, config=config)

In [17]:
# Let's test the model
sentence = 'The company has strong growth prospects.'
sentences = ['growth is strong and we have plenty of liquidity.', 
               'there is a shortage of capital, and we need extra financing.',
              'formulation patents might protect Vasotec to a limited extent.']

pipe = pipeline("text-classification", model=finbert, tokenizer=tokenizer)
results = pipe(sentences)

i = 0
results_log = {}
for result in results:
    sample_result = {}
    sample_result['text'] = sentences[i]
    sample_result['label'] = result['label']
    sample_result['score'] = result['score']
    results_log[i] = sample_result
    i += 1


Now, we can log this information to future use in our fairness report.

In [18]:
faidlog.log(results_log, "sample_results", add_to_fairness_report=True)
results_log

{0: {'text': 'growth is strong and we have plenty of liquidity.',
  'label': 'Positive',
  'score': 1.0},
 1: {'text': 'there is a shortage of capital, and we need extra financing.',
  'label': 'Negative',
  'score': 0.9952379465103149},
 2: {'text': 'formulation patents might protect Vasotec to a limited extent.',
  'label': 'Neutral',
  'score': 0.9979718327522278}}

In [8]:
print("The config has the following labels:" + str(config.id2label))
encoded_input = tokenizer(sentence, padding=True, return_tensors='pt')
output = finbert(**encoded_input)
probs = torch.softmax(output['logits'], dim=1)
label = config.id2label[torch.argmax(probs).item()]
label

The config has the following labels:{0: 'Neutral', 1: 'Positive', 2: 'Negative'}


'Positive'

Since the model is trained using three financial sentiment datasets: **(1)** Corporate Reports 10-K & 10-Q: 2.5B tokens, **(2)** Earnings Call Transcripts: 1.3B tokens, and **(3)** Analyst Reports: 1.1B tokens. So, we cannot use them for evaluation purposes. We will use Financial Phrasebank dataset. 

We will download the data from <https://huggingface.co/datasets/Jean-Baptiste/financial_news_sentiment>.  A more detailed explanation of downloading different finance datasets can be found in our [project home repo: fairness-monitoring](https://github.com/alan-turing-institute/fairness-monitoring/blob/main/notebooks/eda-fin-data.ipynb).

In [9]:
import numpy as np
import pandas as pd
from datasets import Dataset
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)

In [12]:
filename = "./data/financialphrasebank.csv"
#DATASET_CONFIG = { "path": filename, "name": "sentiment"}
# LABEL_MAPPING = { 0: "negative", 1: "neutral", 2: "positive"}
TEXT_COLUMN = "text"
TARGET_COLUMN = "sentiment"
raw_data = pd.read_csv(filename, names=[TARGET_COLUMN, TEXT_COLUMN], encoding="utf-8", encoding_errors="replace")
raw_data.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [None]:
# Profile the data
from ydata_profiling import ProfileReport

profile = ProfileReport(raw_data, title="Profiling Report")
profile.to_notebook_iframe()

In [18]:
import json
profile_dict = json.loads(profile.to_json())

faidlog.log(profile_dict, "data_profile", add_to_fairness_report=True)

In [23]:
# Similar to our samples, profile data includes sample data, so we can log it as well
faidlog.log(profile_dict["sample"][0], "sample_results", add_to_fairness_report=True)

Before using this data in the evaluation, let's test generation of the fairness report, just for the sample data.

In [2]:
# generate the fairness report
fairness_samples = faidlog.get(key=faidlog.keys["sample_data_key"], from_fairness_report=True)
fairness_samples

{0: {'text': 'growth is strong and we have plenty of liquidity.',
  'label': 'Positive',
  'score': 1.0},
 1: {'text': 'there is a shortage of capital, and we need extra financing.',
  'label': 'Negative',
  'score': 0.9952379465103149},
 2: {'text': 'formulation patents might protect Vasotec to a limited extent.',
  'label': 'Neutral',
  'score': 0.9979718327522278}}

In [3]:
# map the sentiment key to label and text key to text
fairness_samples.items()

dict_items([(0, {'text': 'growth is strong and we have plenty of liquidity.', 'label': 'Positive', 'score': 1.0}), (1, {'text': 'there is a shortage of capital, and we need extra financing.', 'label': 'Negative', 'score': 0.9952379465103149}), (2, {'text': 'formulation patents might protect Vasotec to a limited extent.', 'label': 'Neutral', 'score': 0.9979718327522278})])

In [6]:
from faid.utils.report.report_utils import generate_fairness_report
generate_fairness_report(sample_data=fairness_samples, output_file="fairness_report.html")

In [25]:
profile_dict.keys()
profile_dict["variables"].keys()

dict_keys(['sentiment', 'text'])

In [26]:
# In the profiling we found that there are duplicates in the data, remove them and run the profiling again
raw_data.drop_duplicates(subset=["text"], inplace=True)

In [27]:
# The lowest number of samples in a class is 604, so we will balance the data by sampling 604 samples from each class
X_eval_balanced = (raw_data
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=604, random_state=10, replace=True)))

eval_data = Dataset.from_pandas(X_eval_balanced)
X_eval_balanced.sentiment.value_counts()

  .apply(lambda x: x.sample(n=604, random_state=10, replace=True)))


sentiment
negative    604
neutral     604
positive    604
Name: count, dtype: int64

In [28]:
TARGET_STR_INT = config.label2id #{'positive': 2, 'neutral': 1, 'negative': 0}
TARGET_INT_STR = config.id2label #{2: 'positive', 1: 'neutral', 0: 'negative'}

def evaluate(y_true, y_pred):
    def map_func(x):
        return TARGET_STR_INT.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [30]:
def predict(X_test):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i].text
        result = pipe(prompt)
        answer = result[0]['label'].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [None]:
# reminder: 'positive': 2, 'neutral': 1, 'negative': 0
y_pred = predict(X_eval_balanced)
y_true = X_eval_balanced.sentiment.values
evaluate(y_true, y_pred)

# Save the predictions with prompt to a CSV file
X_eval_balanced['predicted_sentiment'] = y_pred
X_eval_balanced.to_csv('./data/output/finbert_predictions_balanced.csv', index=False)

In [None]:
# and the unbalanced data
y_pred = predict(raw_data)
y_true = raw_data.sentiment.values
evaluate(y_true, y_pred)

# Save the predictions with prompt to a CSV file
raw_data_with_predictions = raw_data.copy()
raw_data_with_predictions['predicted_sentiment'] = y_pred
raw_data_with_predictions.to_csv('./data/output/finbert_predictions_unbalanced.csv', index=False)

# 2. Further Evaluation of Bias

Evaluating bias in financial sentiment analysis is challenging due to the complex and nuanced nature of financial language. For example, a news statement often includes domain-specific jargon, idioms, and context-dependent expressions that can vary significantly across different sources and regions. Additionally, financial texts may inherently reflect the perspectives and biases of their authors. 

In our analysis, the dataset consists of fairly straightforward statements. However, we face another challenge which is the ambiguity in defining protected attributes. The financial documents or news rarely contain explicit demographic information, making it challenging to identify and analyze biases against specific groups. 

The lack of standardized benchmarks for measuring bias in this domain complicates the evaluation process, as traditional bias detection methods may not be directly applicable or sufficient. These challenges necessitate the development of specialized tools and methodologies to accurately identify and address bias in financial sentiment analysis, ensuring fair and reliable outcomes.

In this notebook, we will use Ecco, a gradient visualisation tool, to understand which words or word combinations affected the misclassified cases.

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig

from captum.attr import visualization as viz
from captum.attr import IntegratedGradients, LayerConductance, LayerIntegratedGradients
from captum.attr import configure_interpretable_embedding_layer, remove_interpretable_embedding_layer

import torch

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def predict(inputs):
    return finbert(inputs)[0]

ref_token_id = tokenizer.pad_token_id # A token used for generating token reference
sep_token_id = tokenizer.sep_token_id # A token used as a separator between question and text and it is also added to the end of the text.
cls_token_id = tokenizer.cls_token_id # A token used for prepending to the concatenated question-text word sequence

def construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id):

    text_ids = tokenizer.encode(text, add_special_tokens=False)
    # construct input token ids
    input_ids = [cls_token_id] + text_ids + [sep_token_id]
    # construct reference token ids 
    ref_input_ids = [cls_token_id] + [ref_token_id] * len(text_ids) + [sep_token_id]

    return torch.tensor([input_ids], device=device), torch.tensor([ref_input_ids], device=device), len(text_ids)

def construct_input_ref_token_type_pair(input_ids, sep_ind=0):
    seq_len = input_ids.size(1)
    token_type_ids = torch.tensor([[0 if i <= sep_ind else 1 for i in range(seq_len)]], device=device)
    ref_token_type_ids = torch.zeros_like(token_type_ids, device=device)# * -1
    return token_type_ids, ref_token_type_ids

def construct_input_ref_pos_id_pair(input_ids):
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
    # we could potentially also use random permutation with `torch.randperm(seq_length, device=device)`
    ref_position_ids = torch.zeros(seq_length, dtype=torch.long, device=device)

    position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
    ref_position_ids = ref_position_ids.unsqueeze(0).expand_as(input_ids)
    return position_ids, ref_position_ids
    
def construct_attention_mask(input_ids):
    return torch.ones_like(input_ids)

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][0].unsqueeze(-1)

In [None]:
lig = LayerIntegratedGradients(custom_forward, finbert.bert.embeddings)

input_ids, ref_input_ids, sep_id = construct_input_ref_pair(sentence, ref_token_id, sep_token_id, cls_token_id)
token_type_ids, ref_token_type_ids = construct_input_ref_token_type_pair(input_ids, sep_id)
position_ids, ref_position_ids = construct_input_ref_pos_id_pair(input_ids)
attention_mask = construct_attention_mask(input_ids)

indices = input_ids[0].detach().tolist()
all_tokens = tokenizer.convert_ids_to_tokens(indices)

In [None]:
score = predict(input_ids)

print('Question: ', sentence)
print('Predicted Answer: ' + str(torch.argmax(score[0]).numpy()) + ', prob ungrammatical: ' + str(torch.softmax(score, dim = 1)[0][0].detach().numpy()))

In [None]:
def summarize_attributions(attributions):
    attributions = attributions.sum(dim=-1).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    return attributions

In [None]:
attributions, delta = lig.attribute(inputs=input_ids,
                                    baselines=ref_input_ids,
                                    return_convergence_delta=True)

attributions_sum = summarize_attributions(attributions)

In [None]:
# storing couple samples in an array for visualization purposes
score_vis = viz.VisualizationDataRecord(
                        attributions_sum,
                        torch.softmax(score, dim = 1)[0][0],
                        torch.argmax(torch.softmax(score, dim = 1)[0]),
                        1, # Positive Sentiment
                        sentence,
                        attributions_sum.sum(),       
                        all_tokens,
                        delta)

print('\033[1m', 'Visualization For Score', '\033[0m')
viz.visualize_text([score_vis])

## 2.3 Using our prior knowledge create protected attributes

We explored the data and model capabilities and drawbacks using a variety of libraries. Based on the results and existing literature knowledge, let's identify hidden protected attributes.

In [None]:
# Get the indices of the misclassified examples with "positive" labels
misclassified_pos_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'positive' and y_pred[i] != 'positive']

# Get the indices of the misclassified examples with "negative" labels
misclassified_neg_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'negative' and y_pred[i] != 'negative']

# Get the indices of the misclassified examples with "neutral" labels
misclassified_neu_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'neutral' and y_pred[i] != 'neutral']

# 3. Mitigating Bias with Data Augmentation

In the analysis, "" and "" emerged as potential protected attributes in the training process. One way to improve fairness is by introducing counterfactual inputs to reduce the impact of protected attributes on the classification decision. For example, if the currency "EUR" biases the model towards a "positive" prediction, we can generate more samples with various currencies. For instance:

Original sentence: "For the last quarter of 2010, Componenta's net sales doubled to EUR131m from EUR76m for the same period a year earlier, while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m."
Sentiment: Positive

If all sentences with the EUR currency are labeled as positive, the model might incorrectly associate the occurrence of EUR with positivity. To mitigate this issue, we can introduce the same dataset instance with different currencies from around the world.


In [None]:
from faid.mitigation.counterfactual.counterfactual_generator import CounterfactualGenerator

In [None]:
sentence = "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."
vocab_path =  "data/codes-all.csv"
df = pd.read_csv(vocab_path)
vocab_code = df["AlphabeticCode"].values
cf_generator_code = CounterfactualGenerator(vocab_code)

In [None]:
example_cf = cf_generator_code.generate_random_counterfactual(sentence)
example_cf

In [None]:
# Now the example counterfactual is generated, we can use the pipeline to predict the sentiment of the counterfactual
# It is also important to note that the counterfactual is almost meaningless... It uses three different currencies and I have no idea if it is a positive or negative increase, but the overall statement is still positive.
print(pipe(sentence))
print(pipe(example_cf))

In [None]:
sentence = "According to Gran , the company has no plans to move all production to Germany, although that is where the company is growing ."

vocab_ent= df["Entity"].values
cf_generator_ent = CounterfactualGenerator(vocab_ent)
example_cf = cf_generator_ent.generate_random_counterfactual(sentence)
example_cf

In [None]:
print(pipe(sentence))
print(pipe(example_cf))

In [None]:
vocab_path =  "data/codes-all.csv"
target = "Entity"

# Save counterfactuals in a new dataframe with the sentiment

sents = []
cfarr = []

#for i in range(len(X_train)):
for i in range(1):
    sentiment = raw_data.iloc[i]['sentiment']
    cfs = cf_generator_ent.generate_counterfactuals(raw_data.iloc[i]['text'], vocab_path, target)
    for cf in cfs:
        sents.append(sentiment)
        cfarr.append(cf)

cf_df = pd.DataFrame({'sentiment': sents, 'text': cfarr})

# Save it to file
cf_df.to_csv('../data/output/counterfactual/financialphrasebank_cfs.csv', index=False)