# Use Case: Monitoring Bias in Financial Sentiment Analysis Task

In this notebook, we will simulate a developer-point-of-view process to create a financial sentiment analysis pipeline and explore the ways to monitor and mitigate biases in the pipeline using FAID. 

1. Use a pre-trained model: FinBERT. Save the model config and sample data for future referencing.
2. Evaluate the classification fairness using fairlearn. Save the fairness metrics and other information related to fairness experiments to metadata. Generate a HTML report.
3. Bias mitigation with data-augmentation (counterfactuals). Update the metadata with the bias mitigation. Update existing Model Card with the results.

# 1. Project Configuration

In this project, we will use FinBERT. It is one of the early applications of general-capability transformer-based language models (BERT, GPT, etc.) in the financial domain. It is still relevant and used by practitioners and researchers. We will download the model from <https://huggingface.co/yiyanghkust/finbert-tone>

In [1]:
import json
import numpy as np
import pandas as pd
from datasets import Dataset
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from transformers import AutoTokenizer, BertTokenizer, AutoModel, BertForSequenceClassification, BertConfig, pipeline
from tqdm import tqdm
import torch

In [2]:
import sys
sys.path.append('../../')
from faid import logging as faidlog
experiment_name = "financial-sentiment-analysis-finbert-fairness"
faidlog.init_log()

[93mModel log file already exists.  Logging will be appended to the existing file.[0m
[93mData log file already exists. Logging will be appended to the existing file.[0m
[93mRisks log file already exists. Logging will be appended to the existing file.[0m
[93mTransparency log file already exists. Logging will be appended to the existing file.[0m


In [None]:
ctx = faidlog.FairnessExperimentRecord(name=experiment_name)

An ExperimentContext object logs the fairness related results per your request. When you start a new notebook, you can call this context via:

```python
ctx = faidlog.get_ctx(experiment_name)
```


In [4]:
print(faidlog.get_exp_ctx(experiment_name))

name: financial-sentiment-analysis-finbert-fairness
context: {'authors': [], 'start_time': '', 'description': '', 'tags': ['financial-sentiment-analysis', 'sentiment-analysis'], 'hardware': {}, 'license_info': [['captum', 'BSD-3'], ['datasets', 'Apache 2.0'], ['huggingface-hub', 'Apache'], ['immutabledict', 'MIT'], ['ipykernel', 'BSD 3-Clause License\n\nCopyright (c) 2015, IPython Development Team\n\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice, this\n   list of conditions and the following disclaimer.\n\n2. Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n\n3. Neither the name of the copyright holder nor the names of its\n   cont

```faidlog``` also has some useful logging features that can help you creating fairness and transparancy reports. For example, you can get 

In [5]:
#print(faidlog.get_imported_libraries())

In [6]:
#print(faidlog.get_package_licenses())

In [7]:
ctx.add_context_entry("license_info", faidlog.get_package_licenses())

Added license_info to project metadata under ['context'] and log updated


In [8]:
# You can get all the available context information by calling the `get_summary` method
print(ctx.to_dict())

{'name': 'financial-sentiment-analysis-finbert-fairness', 'context': {'authors': [], 'start_time': '', 'description': '', 'tags': ['financial-sentiment-analysis', 'sentiment-analysis'], 'hardware': {}, 'license_info': [('captum', 'BSD-3'), ('datasets', 'Apache 2.0'), ('huggingface-hub', 'Apache'), ('immutabledict', 'MIT'), ('ipykernel', 'BSD 3-Clause License\n\nCopyright (c) 2015, IPython Development Team\n\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice, this\n   list of conditions and the following disclaimer.\n\n2. Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n\n3. Neither the name of the copyright holder nor the names of its\

In [9]:
tags = ["financial-sentiment-analysis","sentiment-analysis"]
ctx.add_context_entry("tags", tags)

Added tags to project metadata under ['context'] and log updated


In [10]:
from faid.report import generate_experiment_overview_report
generate_experiment_overview_report(ctx.to_dict())

The model is fine-tuned on 10,000 manually annotated sentences from analyst reports of S&P 500 firms.

**Input**: A financial text.
**Output**: Positive, Neutral or Negative.

In [11]:
model_name = 'yiyanghkust/finbert-tone'

In [12]:
from faid.scan import get_fairness_score

helm_fairness = get_fairness_score(model_name, html=True)
display(helm_fairness)

Could not find a fairness score for yiyanghkust/finbert-tone. Check https://crfm.stanford.edu/helm/ (Stanford HELM Leaderboard) for a list of models


None

In [13]:
from faid.scan import fairness_benchmark_dropdown
fairness_benchmark_dropdown()

  html_code += f'<option value="{idx}">{row[0]}</option>\n'


In [14]:
finbert = BertForSequenceClassification.from_pretrained(model_name,num_labels=3, output_attentions=True)
#model = AutoModel.from_pretrained(model_name,num_labels=3, output_attentions=True)
model = AutoModel.from_pretrained(model_name,num_labels=3)
tokenizer = BertTokenizer.from_pretrained(model_name)
atokenizer = AutoTokenizer.from_pretrained(model_name)

config = BertConfig.from_pretrained(model_name)



Until now, we only used essential fairness experiment context capabilities. You can also use FAID to create/fill four report types that is required to achieve transparency:
1. **Data Card:** Compatible with Croissant and Google Datacard. In this example, we are not creating a new data, so not using data card logging utils. See this example : [Demo using Prism-Alignment Dataset](./demo-datacard.ipynb)
2. **Model Card:** Compatible with Tensorflow Model Card Generator and Huggingface Model Card. In this notebook, we use a Huggingface model and extract the base model card from the Huggingface Hub.
3. **Risk Register:** RAID type reporting. Report risks, assumptions. issues and dependencies throughout your workflow.
4. **Transparency Report:** You can generate this report based on UK's Algorithmic Transparency Standard Recording format.

In [15]:
# get model card from huggingface model
from huggingface_hub import ModelCard
card = ModelCard.load('yiyanghkust/finbert-tone')

In [16]:
# You can access model card metadata from huggingface model
card.data

{'language': 'en', 'license': None, 'library_name': None, 'tags': ['financial-sentiment-analysis', 'sentiment-analysis'], 'base_model': None, 'datasets': None, 'metrics': None, 'eval_results': None, 'model_name': None, 'widget': [{'text': 'growth is strong and we have plenty of liquidity'}]}

In [17]:
# You can also access model card text from huggingface model
card.text

'\n`FinBERT` is a BERT model pre-trained on financial communication text. The purpose is to enhance financial NLP research and practice. It is trained on the following three financial communication corpus. The total corpora size is 4.9B tokens.\n- Corporate Reports 10-K & 10-Q: 2.5B tokens\n- Earnings Call Transcripts: 1.3B tokens\n- Analyst Reports: 1.1B tokens\n\nMore technical details on `FinBERT`: [Click Link](https://github.com/yya518/FinBERT)\n\nThis released `finbert-tone` model is the `FinBERT` model fine-tuned on 10,000 manually annotated (positive, negative, neutral) sentences from analyst reports. This model achieves superior performance on financial tone analysis task. If you are simply interested in using `FinBERT` for financial tone analysis, give it a try.\n\nIf you use the model in your academic work, please cite the following paper:\n\nHuang, Allen H., Hui Wang, and Yi Yang. "FinBERT: A Large Language Model for Extracting Information from Financial Text." *Contemporary

In [18]:
# Start the log YAML file, you can see the output in the log folder in the root directory
model_info = faidlog.ModelCard()
model_info.set_model_details({
    "name": model_name,
    "overview": card.text.join(" ").join(card.data.to_dict())
    })

In [19]:
# Alternatively you can use ModelCard class to get the model card
# If the model card is already available in the log directory, it will load the model card from the log directory
model_info = faidlog.ModelCard()
print(model_info.to_dict())

{'model_details': {'name': '', 'overview': None, 'documentation': None, 'owners': [{'name': '', 'contact': ''}], 'version': {'name': None, 'date': None, 'diff': None}, 'license': {'identifier': None, 'custom_text': None}, 'references': None, 'citation': None, 'path': None}, 'model_parameters': {'description': None, 'model_architecture': '', 'data': [{'description': '', 'link': None, 'sensitive': None, 'graphics': None}], 'input_format': None, 'output_format': None, 'output_format_map': None}, 'quantitative_analysis': {'description': None, 'performance_metrics': [{'name': None, 'description': None, 'value': None, 'slice': None, 'confidence_interval': {'description': 'Explain your interval selection methodology. e.g. <https://fairlearn.org/v0.11/user_guide/assessment/confidence_interval_estimation.html>', 'lower_bound': None, 'upper_bound': None}}]}, 'considerations': {'description': None, 'intended_users': '', 'use_cases': '', 'limitations': '', 'tradeoffs': None, 'ethical_consideration

In [20]:
model_info.set_model_parameters(
    {
        "description": "FinBERT is a pre-trained language model for financial sentiment analysis. It is based on the BERT model architecture and is fine-tuned on a large financial news dataset.",
        "num_layers": 12,
        "num_labels": 3,
        "output_attentions": True,
        "output_hidden_states": True,
        "hidden_dropout_prob": 0.1,
        "attention_probs_dropout_prob": 0.1,
    }
)

In [21]:
model_info.save()

[92mModel info saved to the model log file.[0m


In [22]:
faidlog.get_model_entry()

{'model_details': {'name': '',
  'overview': None,
  'documentation': None,
  'owners': [{'name': '', 'contact': ''}],
  'version': {'name': None, 'date': None, 'diff': None},
  'license': {'identifier': None, 'custom_text': None},
  'references': None,
  'citation': None,
  'path': None},
 'model_parameters': {'description': None,
  'model_architecture': '',
  'data': [{'description': '',
    'link': None,
    'sensitive': None,
    'graphics': None}],
  'input_format': None,
  'output_format': None,
  'output_format_map': None},
 'quantitative_analysis': {'description': None,
  'performance_metrics': [{'name': None,
    'description': None,
    'value': None,
    'slice': None,
    'confidence_interval': {'description': 'Explain your interval selection methodology. e.g. <https://fairlearn.org/v0.11/user_guide/assessment/confidence_interval_estimation.html>',
     'lower_bound': None,
     'upper_bound': None}}]},
 'considerations': {'description': None,
  'intended_users': '',
  'use

In [23]:
# When doesn't specify, the default key is "model_info"
# So, we can get the model card information by calling the following method
# faidlog.get_model_entry("model_info")
# Alternatively, you can get the model card information by calling the following method
faidlog.get_model_entry()

{'model_details': {'name': '',
  'overview': None,
  'documentation': None,
  'owners': [{'name': '', 'contact': ''}],
  'version': {'name': None, 'date': None, 'diff': None},
  'license': {'identifier': None, 'custom_text': None},
  'references': None,
  'citation': None,
  'path': None},
 'model_parameters': {'description': None,
  'model_architecture': '',
  'data': [{'description': '',
    'link': None,
    'sensitive': None,
    'graphics': None}],
  'input_format': None,
  'output_format': None,
  'output_format_map': None},
 'quantitative_analysis': {'description': None,
  'performance_metrics': [{'name': None,
    'description': None,
    'value': None,
    'slice': None,
    'confidence_interval': {'description': 'Explain your interval selection methodology. e.g. <https://fairlearn.org/v0.11/user_guide/assessment/confidence_interval_estimation.html>',
     'lower_bound': None,
     'upper_bound': None}}]},
 'considerations': {'description': None,
  'intended_users': '',
  'use

In [25]:
from faid.report import generate_model_card_report
generate_model_card_report()
# The report will be saved in the reports/ folder in the root directory

In [26]:
faidlog.add_risk_entry({
    "description": "The model card information does not contain the model's training data information.",
    "impact": "High",
    "likelihood": "High",
    "mitigation": "Add the training data information to the model card. You can find it from their research paper.",
}, "risks")

Added risks to risk register


In [27]:
faidlog.add_risk_entry({
    "description": "The model card information does not contain the model's training data information.",
    "impact": "High",
    "status": "TODO",
    "action": "Add the training data information to the model card. You can find it from their research paper.",
}, "issues")

Added issues to risk register


In [28]:
risks = faidlog.get_risk_entry()
risks

{'risks': {0: {'description': '',
   'impact': '',
   'likelihood': '',
   'mitigation': ''},
  2: {'description': "The model card information does not contain the model's training data information.",
   'impact': 'High',
   'likelihood': 'High',
   'mitigation': 'Add the training data information to the model card. You can find it from their research paper.'}},
 'assumptions': {0: {'description': '', 'impact': '', 'action': ''}},
 'issues': {0: {'description': '', 'impact': '', 'status': '', 'action': ''},
  2: {'description': "The model card information does not contain the model's training data information.",
   'impact': 'High',
   'status': 'TODO',
   'action': 'Add the training data information to the model card. You can find it from their research paper.'}},
 'dependencies': {0: {'description': '',
   'impact': '',
   'status': '',
   'action': ''}}}

In [29]:
from faid.report import generate_risk_register_report
generate_risk_register_report()

In [30]:
# If you want to generate all reports at once, you can call the following method
# This function will generate reports for every yml file in the log directory
from faid.report import generate_all_reports
generate_all_reports()

All reports generated


## Sample Sentences from Dataset

Since the model is trained using three financial sentiment datasets: **(1)** Corporate Reports 10-K & 10-Q: 2.5B tokens, **(2)** Earnings Call Transcripts: 1.3B tokens, and **(3)** Analyst Reports: 1.1B tokens. So, we cannot use them for evaluation purposes. We will use Financial Phrasebank dataset. We will download the data from <https://huggingface.co/datasets/Jean-Baptiste/financial_news_sentiment>.  A more detailed explanation of downloading different finance datasets can be found in our [project home repo: fairness-monitoring](https://github.com/alan-turing-institute/fairness-monitoring/blob/main/notebooks/eda-fin-data.ipynb).

In [31]:
# Let's test the model
sentence = 'The company has strong growth prospects.'
sentences = ['growth is strong and we have plenty of liquidity.', 
               'there is a shortage of capital, and we need extra financing.',
              'formulation patents might protect Vasotec to a limited extent.']

pipe = pipeline("text-classification", model=finbert, tokenizer=tokenizer)
results = pipe(sentences)

i = 0
results_log = {}
for result in results:
    sample_result = {}
    sample_result['text'] = sentences[i]
    sample_result['label'] = result['label']
    sample_result['score'] = result['score']
    results_log[i] = sample_result
    i += 1




Now, we can log this information to future use in our fairness report.

In [32]:
ctx.add_data_entry("sample", results_log)
results_log

Added sample to project metadata under ['data'] and log updated


{0: {'text': 'growth is strong and we have plenty of liquidity.',
  'label': 'Positive',
  'score': 1.0},
 1: {'text': 'there is a shortage of capital, and we need extra financing.',
  'label': 'Negative',
  'score': 0.9952379465103149},
 2: {'text': 'formulation patents might protect Vasotec to a limited extent.',
  'label': 'Neutral',
  'score': 0.9979718327522278}}

In [33]:
print("The config has the following labels:" + str(config.id2label))
encoded_input = tokenizer(sentence, padding=True, return_tensors='pt')
output = finbert(**encoded_input)
probs = torch.softmax(output['logits'], dim=1)
label = config.id2label[torch.argmax(probs).item()]
label

The config has the following labels:{0: 'Neutral', 1: 'Positive', 2: 'Negative'}


'Positive'

In [34]:
filename = "./data/financialphrasebank.csv"
#DATASET_CONFIG = { "path": filename, "name": "sentiment"}
# LABEL_MAPPING = { 0: "negative", 1: "neutral", 2: "positive"}
TEXT_COLUMN = "text"
TARGET_COLUMN = "sentiment"
raw_data = pd.read_csv(filename, names=[TARGET_COLUMN, TEXT_COLUMN], encoding="utf-8", encoding_errors="replace")
raw_data.head()

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...


In [35]:
# Profile the data
from ydata_profiling import ProfileReport

profile = ProfileReport(raw_data, title="Profiling Report")
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [36]:
profile_dict = json.loads(profile.to_json())

profile_dict.keys()

Render JSON:   0%|          | 0/1 [00:00<?, ?it/s]

dict_keys(['analysis', 'time_index_analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'alerts', 'package', 'sample', 'duplicates'])

In [38]:
print(profile_dict["variables"].keys())
# Include the profile related to sentiment labels in the log
ctx.add_data_entry(key="variable_profile", entry=profile_dict["variables"]["sentiment"]["word_counts"])

dict_keys(['sentiment', 'text'])
Added variable_profile to project metadata under ['data'] and log updated


Before using this data in the evaluation, let's test generation of the fairness report, just with the sample data and variable profile.

In [39]:
from faid.report import generate_experiment_overview_report
generate_experiment_overview_report(ctx.to_dict())

In [40]:
profile_dict.keys()

dict_keys(['analysis', 'time_index_analysis', 'table', 'variables', 'scatter', 'correlations', 'missing', 'alerts', 'package', 'sample', 'duplicates'])

# 2. Predict Sentiment and Eval Group Fairness

In [31]:
# In the profiling we found that there are duplicates in the data, remove them and run the profiling again
raw_data.drop_duplicates(subset=["text"], inplace=True)

In [32]:
# The lowest number of samples in a class is 604, so we will balance the data by sampling 604 samples from each class
X_eval_balanced = (raw_data
          .groupby('sentiment', group_keys=False)
          .apply(lambda x: x.sample(n=604, random_state=10, replace=True)))

eval_data = Dataset.from_pandas(X_eval_balanced)
X_eval_balanced.sentiment.value_counts()

sentiment
negative    604
neutral     604
positive    604
Name: count, dtype: int64

In [33]:
TARGET_STR_INT = config.label2id #{'positive': 2, 'neutral': 1, 'negative': 0}
TARGET_INT_STR = config.id2label #{2: 'positive', 1: 'neutral', 0: 'negative'}

def evaluate(y_true, y_pred):
    def map_func(x):
        return TARGET_STR_INT.get(x, 1)
    
    y_true = np.vectorize(map_func)(y_true)
    y_pred = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true, y_pred=y_pred)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == label]
        label_y_true = [y_true[i] for i in label_indices]
        label_y_pred = [y_pred[i] for i in label_indices]
        accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {label}: {accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true, y_pred=y_pred)
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true, y_pred=y_pred, labels=[0, 1, 2])
    print('\nConfusion Matrix:')
    print(conf_matrix)

In [34]:
def predict(X_test):
    y_pred = []
    for i in tqdm(range(len(X_test))):
        prompt = X_test.iloc[i].text
        result = pipe(prompt)
        answer = result[0]['label'].lower()
        if "positive" in answer:
            y_pred.append("positive")
        elif "negative" in answer:
            y_pred.append("negative")
        elif "neutral" in answer:
            y_pred.append("neutral")
        else:
            y_pred.append("none")
    return y_pred

In [35]:
# reminder: 'positive': 2, 'neutral': 1, 'negative': 0
y_pred = predict(X_eval_balanced)
y_true = X_eval_balanced.sentiment.values
evaluate(y_true, y_pred)

# Save the predictions with prompt to a CSV file
X_eval_balanced['predicted_sentiment'] = y_pred
X_eval_balanced.to_csv('./data/output/finbert_predictions_balanced.csv', index=False)

100%|██████████| 1812/1812 [01:32<00:00, 19.61it/s]

Accuracy: 1.000
Accuracy for label 1: 1.000

Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00      1812

    accuracy                           1.00      1812
   macro avg       1.00      1.00      1.00      1812
weighted avg       1.00      1.00      1.00      1812


Confusion Matrix:
[[   0    0    0]
 [   0 1812    0]
 [   0    0    0]]





In [36]:
# and the unbalanced data
y_pred = predict(raw_data)
y_true = raw_data.sentiment.values
evaluate(y_true, y_pred)

# Save the predictions with prompt to a CSV file
raw_data_with_predictions = raw_data.copy()
raw_data_with_predictions['predicted_sentiment'] = y_pred
raw_data_with_predictions.to_csv('./data/output/finbert_predictions_unbalanced.csv', index=False)

100%|██████████| 4838/4838 [03:54<00:00, 20.66it/s]

Accuracy: 1.000
Accuracy for label 1: 1.000

Classification Report:
              precision    recall  f1-score   support

           1       1.00      1.00      1.00      4838

    accuracy                           1.00      4838
   macro avg       1.00      1.00      1.00      4838
weighted avg       1.00      1.00      1.00      4838


Confusion Matrix:
[[   0    0    0]
 [   0 4838    0]
 [   0    0    0]]





In [37]:
# Check if X_eval_balanced["text"] contains the text "EUR" in it, if it does, then mark 1 to the new column "eur", else 0
raw_data["eur"] = raw_data["text"].apply(lambda x: 1 if "EUR" in x else 0)

y_pred_series = pd.Series(y_pred)
y_pred_binary = y_pred_series.apply(lambda x: 1 if "positive" in x else 0)

y_true_series = pd.Series(y_true)
y_true_binary = y_true_series.apply(lambda x: 1 if "positive" in x else 0)

In [38]:
raw_data["eur"].value_counts()

eur
0    4012
1     826
Name: count, dtype: int64

In [39]:
from fairlearn.metrics import (
    MetricFrame,
    count,
    false_negative_rate,
    false_positive_rate,
    selection_rate,
)

pos_label = TARGET_STR_INT["Positive"]

metrics = {
    "accuracy": accuracy_score,
    "false positive rate": false_positive_rate,
    "false negative rate": false_negative_rate,
    "selection rate": selection_rate,
    "count": count,
}

sensitive_features = raw_data["eur"]

metric_frame = MetricFrame(
    metrics=metrics, 
    y_true=y_true_binary, 
    y_pred=y_pred_binary, 
    sensitive_features=sensitive_features
)

  mf = mf.applymap(lambda x: x if np.isscalar(x) else np.nan)
  mf = mf.applymap(lambda x: x if np.isscalar(x) else np.nan)
  mf = mf.applymap(lambda x: x if np.isscalar(x) else np.nan)


In [40]:
results = metric_frame.by_group.to_dict()
results

{'accuracy': {0: 0.8354935194416749, 1: 0.8704600484261501},
 'false positive rate': {0: 0.05850706119704102, 1: 0.027888446215139442},
 'false negative rate': {0: 0.4682080924855491, 1: 0.28703703703703703},
 'selection rate': {0: 0.18095712861415753, 1: 0.2966101694915254},
 'count': {0: 4012.0, 1: 826.0}}

In [41]:
ctx.add_entry("bygroup_metrics", results)

Added bygroup_metrics to project metadata and log updated


In [42]:
generate_experiment_overview_report(ctx.to_dict())

## Further Evaluation of Bias

Evaluating bias in financial sentiment analysis is challenging due to the complex and nuanced nature of financial language. For example, a news statement often includes domain-specific jargon, idioms, and context-dependent expressions that can vary significantly across different sources and regions. Additionally, financial texts may inherently reflect the perspectives and biases of their authors. 

In our analysis, the dataset consists of fairly straightforward statements. However, we face another challenge which is the ambiguity in defining protected attributes. The financial documents or news rarely contain explicit demographic information, making it challenging to identify and analyze biases against specific groups. 

The lack of standardized benchmarks for measuring bias in this domain complicates the evaluation process, as traditional bias detection methods may not be directly applicable or sufficient. These challenges necessitate the development of specialized tools and methodologies to accurately identify and address bias in financial sentiment analysis, ensuring fair and reliable outcomes.

In this notebook, we will use Ecco, a gradient visualisation tool, to understand which words or word combinations affected the misclassified cases.

In [43]:
from transformers import BertTokenizer, BertForSequenceClassification, BertConfig

from captum.attr import visualization as viz
from captum.attr import IntegratedGradients, LayerConductance, LayerIntegratedGradients
from captum.attr import configure_interpretable_embedding_layer, remove_interpretable_embedding_layer

import torch

In [44]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def predict(inputs):
    return finbert(inputs)[0]

ref_token_id = tokenizer.pad_token_id # A token used for generating token reference
sep_token_id = tokenizer.sep_token_id # A token used as a separator between question and text and it is also added to the end of the text.
cls_token_id = tokenizer.cls_token_id # A token used for prepending to the concatenated question-text word sequence

def construct_input_ref_pair(text, ref_token_id, sep_token_id, cls_token_id):

    text_ids = tokenizer.encode(text, add_special_tokens=False)
    # construct input token ids
    input_ids = [cls_token_id] + text_ids + [sep_token_id]
    # construct reference token ids 
    ref_input_ids = [cls_token_id] + [ref_token_id] * len(text_ids) + [sep_token_id]

    return torch.tensor([input_ids], device=device), torch.tensor([ref_input_ids], device=device), len(text_ids)

def construct_input_ref_token_type_pair(input_ids, sep_ind=0):
    seq_len = input_ids.size(1)
    token_type_ids = torch.tensor([[0 if i <= sep_ind else 1 for i in range(seq_len)]], device=device)
    ref_token_type_ids = torch.zeros_like(token_type_ids, device=device)# * -1
    return token_type_ids, ref_token_type_ids

def construct_input_ref_pos_id_pair(input_ids):
    seq_length = input_ids.size(1)
    position_ids = torch.arange(seq_length, dtype=torch.long, device=device)
    # we could potentially also use random permutation with `torch.randperm(seq_length, device=device)`
    ref_position_ids = torch.zeros(seq_length, dtype=torch.long, device=device)

    position_ids = position_ids.unsqueeze(0).expand_as(input_ids)
    ref_position_ids = ref_position_ids.unsqueeze(0).expand_as(input_ids)
    return position_ids, ref_position_ids
    
def construct_attention_mask(input_ids):
    return torch.ones_like(input_ids)

def custom_forward(inputs):
    preds = predict(inputs)
    return torch.softmax(preds, dim = 1)[0][0].unsqueeze(-1)

In [45]:
lig = LayerIntegratedGradients(custom_forward, finbert.bert.embeddings)

input_ids, ref_input_ids, sep_id = construct_input_ref_pair(sentence, ref_token_id, sep_token_id, cls_token_id)
token_type_ids, ref_token_type_ids = construct_input_ref_token_type_pair(input_ids, sep_id)
position_ids, ref_position_ids = construct_input_ref_pos_id_pair(input_ids)
attention_mask = construct_attention_mask(input_ids)

indices = input_ids[0].detach().tolist()
all_tokens = tokenizer.convert_ids_to_tokens(indices)

In [46]:
score = predict(input_ids)

print('Question: ', sentence)
print('Predicted Answer: ' + str(torch.argmax(score[0]).numpy()) + ', prob ungrammatical: ' + str(torch.softmax(score, dim = 1)[0][0].detach().numpy()))

Question:  The company has strong growth prospects.
Predicted Answer: 1, prob ungrammatical: 9.0680946e-10


In [47]:
def summarize_attributions(attributions):
    attributions = attributions.sum(dim=-1).squeeze(0)
    attributions = attributions / torch.norm(attributions)
    return attributions

In [48]:
attributions, delta = lig.attribute(inputs=input_ids,
                                    baselines=ref_input_ids,
                                    return_convergence_delta=True)

attributions_sum = summarize_attributions(attributions)

In [49]:
# storing couple samples in an array for visualization purposes
score_vis = viz.VisualizationDataRecord(
                        attributions_sum,
                        torch.softmax(score, dim = 1)[0][0],
                        torch.argmax(torch.softmax(score, dim = 1)[0]),
                        1, # Positive Sentiment
                        sentence,
                        attributions_sum.sum(),       
                        all_tokens,
                        delta)

print('\033[1m', 'Visualization For Score', '\033[0m')
viz.visualize_text([score_vis])

[1m Visualization For Score [0m


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,1 (0.00),The company has strong growth prospects.,-1.39,[CLS] the company has strong growth prospects . [SEP]
,,,,


True Label,Predicted Label,Attribution Label,Attribution Score,Word Importance
1.0,1 (0.00),The company has strong growth prospects.,-1.39,[CLS] the company has strong growth prospects . [SEP]
,,,,


## Using our prior knowledge create protected attributes

We explored the data and model capabilities and drawbacks using a variety of libraries. Based on the results and existing literature knowledge, let's identify hidden protected attributes.

In [50]:
# Get the indices of the misclassified examples with "positive" labels
misclassified_pos_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'positive' and y_pred[i] != 'positive']

# Get the indices of the misclassified examples with "negative" labels
misclassified_neg_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'negative' and y_pred[i] != 'negative']

# Get the indices of the misclassified examples with "neutral" labels
misclassified_neu_indices = [i for i in range(len(y_true)) 
                         if y_true[i] == 'neutral' and y_pred[i] != 'neutral']

# 3. Mitigating Bias with Data Augmentation

In the analysis, "" and "" emerged as potential protected attributes in the training process. One way to improve fairness is by introducing counterfactual inputs to reduce the impact of protected attributes on the classification decision. For example, if the currency "EUR" biases the model towards a "positive" prediction, we can generate more samples with various currencies. For instance:

Original sentence: "For the last quarter of 2010, Componenta's net sales doubled to EUR131m from EUR76m for the same period a year earlier, while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m."
Sentiment: Positive

If all sentences with the EUR currency are labeled as positive, the model might incorrectly associate the occurrence of EUR with positivity. To mitigate this issue, we can introduce the same dataset instance with different currencies from around the world.


In [None]:
from .counterfactual_generator import CounterfactualGenerator

In [None]:
sentence = "For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m ."
vocab_path =  "data/codes-all.csv"
df = pd.read_csv(vocab_path)
vocab_code = df["AlphabeticCode"].values
cf_generator_code = CounterfactualGenerator(vocab_code)

In [None]:
example_cf = cf_generator_code.generate_random_counterfactual(sentence)
example_cf

In [None]:
# Now the example counterfactual is generated, we can use the pipeline to predict the sentiment of the counterfactual
# It is also important to note that the counterfactual is almost meaningless... It uses three different currencies and I have no idea if it is a positive or negative increase, but the overall statement is still positive.
print(pipe(sentence))
print(pipe(example_cf))

In [None]:
sentence = "According to Gran , the company has no plans to move all production to Germany, although that is where the company is growing ."

vocab_ent= df["Entity"].values
cf_generator_ent = CounterfactualGenerator(vocab_ent)
example_cf = cf_generator_ent.generate_random_counterfactual(sentence)
example_cf

In [None]:
print(pipe(sentence))
print(pipe(example_cf))

In [None]:
vocab_path =  "data/codes-all.csv"
target = "Entity"

# Save counterfactuals in a new dataframe with the sentiment

sents = []
cfarr = []

#for i in range(len(X_train)):
for i in range(1):
    sentiment = raw_data.iloc[i]['sentiment']
    cfs = cf_generator_ent.generate_counterfactuals(raw_data.iloc[i]['text'], vocab_path, target)
    for cf in cfs:
        sents.append(sentiment)
        cfarr.append(cf)

cf_df = pd.DataFrame({'sentiment': sents, 'text': cfarr})

# Save it to file
cf_df.to_csv('./data/output/counterfactual/financialphrasebank_cfs.csv', index=False)