# Bias Detection

Bias in a machine learning model is the error introduced by overly simplistic assumptions in the learning algorithm, leading to underfitting where the model fails to capture the underlying patterns in the data. It is a component of the bias-variance trade-off, where high bias results in a model that performs poorly on both training and unseen data. Balancing bias and variance is crucial for creating a model that generalizes well.

In [9]:
!pip install aif360 --quiet

In [10]:
## Import libraries

In [15]:
import torch
import numpy as np
import pandas as pd
from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

The `TextGenerationModel` class is a PyTorch module designed for text generation using an LSTM network. 
It initializes with four parameters: 
- `input_size` 
- `hidden_size`
- `output_size`
- `vocab_size`

The class includes an embedding layer (`nn.Embedding`) to convert word indices into dense vectors, an LSTM layer (`nn.LSTM`) to process these sequences, and a fully connected layer (`nn.Linear`) to map the LSTM's hidden state output to the desired output size, which represents the vocabulary size for predicting the next word. 

In the `forward` method, the input tensor `x` is first passed through the embedding layer, then processed by the LSTM layer, and finally, the output of the LSTM's last time step is passed through the fully connected layer to generate the prediction for the next word in the sequence.

In [16]:
class TextGenerationModel(nn.Module):
    """
    A PyTorch module for text generation using LSTM.

    Attributes:
        input_size (int): The size of the input vocabulary.
        hidden_size (int): The size of the hidden LSTM layer.
        output_size (int): The size of the output vocabulary.
        vocab_size (int): The size of the vocabulary.
    """

    def __init__(self, input_size, hidden_size, output_size, vocab_size):
        """
        Initializes the TextGenerationModel.

        Args:
            input_size (int): The size of the input vocabulary.
            hidden_size (int): The size of the hidden LSTM layer.
            output_size (int): The size of the output vocabulary.
            vocab_size (int): The size of the vocabulary.
        """
        super(TextGenerationModel, self).__init__()
        self.hidden_size = hidden_size
        self.embed = nn.Embedding(vocab_size, input_size)
        self.lstm = nn.LSTM(input_size, hidden_size)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        """
        Performs a forward pass through the model.

        Args:
            x (torch.Tensor): The input tensor representing the text data.

        Returns:
            torch.Tensor: The output tensor representing the generated text.
        """
        x = self.embed(x)
        out, _ = self.lstm(x)
        out = self.fc(out[:, -1, :])
        return out

The `measure_fairness` function evaluates fairness metrics for a text generation model, particularly concerning a specified sensitive attribute such as gender or race. It accepts a model, a dataset comprising prompts, corresponding texts, and sensitive attribute values, along with a vocabulary mapping tokens to their embedding indices. 

For each data point, the function converts the prompt into a tensor using the vocabulary, feeds it to the model to generate outputs, and records whether the generated text matches the expected text. It also collects the sensitive attribute values. 
These outputs, labels, and sensitive attributes are then converted into numpy arrays and used to create a pandas DataFrame. This DataFrame is transformed into a `BinaryLabelDataset` for evaluating fairness metrics. The function specifically calculates the disparate impact metric, which measures the ratio of favorable outcomes between privileged and unprivileged groups. The result, encapsulated in a dictionary, is returned as the fairness metric values.

In [17]:
def measure_fairness(model, data, sensitive_attr, vocab):
    """
    Measure fairness metrics for a text generation model.

    Args:
        model (nn.Module): The text generation model.
        data (list): A list of tuples (prompt, text, sensitive_attr_value).
        sensitive_attr (str): The sensitive attribute to measure fairness for (e.g., 'gender', 'race').
        vocab (dict): A dictionary mapping tokens to their indices in the embedding layer.

    Returns:
        dict: A dictionary containing fairness metric values.
    """
    outputs = []
    labels = []
    sensitive_attrs = []

    for prompt, text, attr_value in data:
        prompt_tensor = torch.tensor([vocab[token] for token in prompt.split()], dtype=torch.long)
        output = model(prompt_tensor.unsqueeze(0))
        outputs.append(output.squeeze().detach().numpy())
        labels.append(int(text in output_vocab))
        sensitive_attrs.append(attr_value)

    outputs = np.array(outputs)
    labels = np.array(labels)
    sensitive_attrs = np.array(sensitive_attrs)

    df = pd.DataFrame({"label": labels, sensitive_attr: sensitive_attrs})

    dataset = BinaryLabelDataset(
        favorable_label=1,
        unfavorable_label=0,
        df=df,
        label_names=['label'],
        protected_attribute_names=[sensitive_attr],
        unprivileged_protected_attributes=[{sensitive_attr: 1}],
    )

    dataset.scores = outputs

    metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups=[{sensitive_attr: 1}], privileged_groups=[{sensitive_attr: 0}])
    disparate_impact = metric.disparate_impact()
    fairness_metrics = {
        "Disparate Impact": disparate_impact
    }

    return fairness_metrics

In this code, a small dataset (`data`) is defined, consisting of four tuples where each tuple contains a prompt, a continuation text, and a binary sensitive attribute indicating gender (0 for male, 1 for female). 

- A vocabulary dictionary (`vocab`) maps words to their corresponding indices, and an output vocabulary (`output_vocab`) lists the possible generated text completions. 

- A text generation model (`TextGenerationModel`) is instantiated with specified sizes for input, hidden, and output layers, and the embedding layer's weights are initialized uniformly between -1 and 1. 

- The `measure_fairness` function is then called to evaluate the fairness of the model concerning gender. 

This function processes the data to generate model outputs, labels, and sensitive attribute values, calculates the disparate impact metric, and returns it in a dictionary. Finally, the calculated fairness metrics are printed.

In [18]:
data = [
    ("The young wizard ", "waved his wand and cast a powerful spell.", 0),  # Male
    ("The enchantress ", "conjured a magical potion from thin air.", 1),  # Female
    ("The wise sorcerer ", "consulted ancient tomes for forbidden knowledge.", 0),  # Male
    ("The sorceress ", "channeled the elements to bend them to her will.", 1),  # Female
]

vocab = {'The': 0, 'young': 1, 'wizard': 2, 'waved': 3, 'his': 4, 'wand': 5, 'and': 6, 'cast': 7, 'a': 8, 'powerful': 9, 'spell.': 10, 'enchantress': 11, 'conjured': 12, 'magical': 13, 'potion': 14, 'from': 15, 'thin': 16, 'air.': 17, 'wise': 18, 'sorcerer': 19, 'consulted': 20, 'ancient': 21, 'tomes': 22, 'for': 23, 'forbidden': 24, 'knowledge.': 25, 'sorceress': 26, 'channeled': 27, 'the': 28, 'elements': 29, 'to': 30, 'bend': 31, 'them': 32, 'her': 33, 'will.': 34}
output_vocab = ['waved his wand and cast a powerful spell.', 'conjured a magical potion from thin air.', 'consulted ancient tomes for forbidden knowledge.', 'channeled the elements to bend them to her will.']

# Initialize model
model = TextGenerationModel(input_size=10, hidden_size=20, output_size=len(output_vocab), vocab_size=len(vocab))
model.embed.weight.data.uniform_(-1, 1)

# Measure fairness
fairness_metrics = measure_fairness(model, data, "gender", vocab)
print(fairness_metrics)


{'Disparate Impact': 1.0}


In the context of fairness metrics, the Disparate Impact (DI) ratio is a measure used to evaluate whether different groups (typically defined by sensitive attributes such as gender, race, etc.) receive favorable outcomes at different rates. It's calculated as the ratio of the rate of favorable outcomes for the unprivileged group to the rate of favorable outcomes for the privileged group.

A Disparate Impact value of 1 indicates perfect equality between the groups being compared. Specifically, it means that the rate of favorable outcomes (e.g., correct text generation or correct predictions) is the same for both the unprivileged group and the privileged group. In other words, the model does not show any bias against either group according to this metric.


- **Disparate Impact = 1**: The model treats both groups equally in terms of favorable outcomes.
- **Disparate Impact < 1**: The unprivileged group receives favorable outcomes at a lower rate than the privileged group, indicating potential bias against the unprivileged group.
- **Disparate Impact > 1**: The unprivileged group receives favorable outcomes at a higher rate than the privileged group, which could indicate bias in favor of the unprivileged group.

In your case, a Disparate Impact of 1 suggests that your text generation model is fair with respect to the sensitive attribute being measured (gender in this case), as it provides equal favorable outcome rates for both male and female attributes.

In [19]:
data = [
    ("The young wizard ", "waved his wand and cast a powerful spell.", 0),  # Male
    ("The enchantress ", "conjured a magical potion from thin air.", 1),  # Female
    ("The wise sorcerer ", "consulted ancient tomes for forbidden knowledge.", 0),  # Male
    ("The sorceress ", "attempted to cast a spell but failed.", 1),  # Female - modified text
]

# Initialize model
model = TextGenerationModel(input_size=10, hidden_size=20, output_size=len(output_vocab), vocab_size=len(vocab))
model.embed.weight.data.uniform_(-1, 1)

# Measure fairness
fairness_metrics = measure_fairness(model, data, "gender", vocab)
print(fairness_metrics)


{'Disparate Impact': 0.5}


A Disparate Impact (DI) value of 0.5 in this context means that the rate of favorable outcomes for the unprivileged group (in this case, likely the female gender group with the sensitive attribute value of 1) is half that of the privileged group (the male gender group with the sensitive attribute value of 0). This suggests potential bias in the model's predictions, favoring the privileged group (males) over the unprivileged group (females).

Let's break this down with the given data and your modified example:

### Data Breakdown
- **Male group (sensitive attribute value 0)**:
  1. Prompt: "The young wizard ", Expected output: "waved his wand and cast a powerful spell."
  2. Prompt: "The wise sorcerer ", Expected output: "consulted ancient tomes for forbidden knowledge."

- **Female group (sensitive attribute value 1)**:
  1. Prompt: "The enchantress ", Expected output: "conjured a magical potion from thin air."
  2. Prompt: "The sorceress ", Expected output: "attempted to cast a spell but failed." (modified text)

### Interpretation of Disparate Impact 0.5
In the context of the Disparate Impact metric:
- If the model correctly predicts the expected output more frequently for one group over the other, it indicates bias.
- A DI of 0.5 means the female group's favorable outcomes rate is 50% of the male group's rate.

### Example Scenario
- Suppose the model correctly generates the expected output for the male group twice (both instances).
- Suppose it correctly generates the expected output for the female group once.

The favorable outcome rate for each group might look something like this:
- Male group: 2 correct out of 2 instances (100%)
- Female group: 1 correct out of 2 instances (50%)

Disparate Impact calculation:
\[ \text{DI} = \frac{\text{Favorable rate for females}}{\text{Favorable rate for males}} = \frac{0.5}{1.0} = 0.5 \]

This indicates that females receive favorable outcomes at half the rate of males, highlighting a bias in the model's predictions against the female group.

### Conclusion
A DI value of 0.5 reveals that the model's performance is biased, providing favorable outcomes to females only half as often as it does to males. This suggests that the model might require further training, adjustment, or re-evaluation to ensure fair treatment across both gender groups.