# Coreference Resolution and Bias Mitigation in NLP

## Project Overview
Coreference resolution is the task of identifying when different words or phrases in a text refer to the same entity. This task is especially crucial in understanding text coherence, which helps in a variety of natural language processing (NLP) applications such as machine translation, summarization, sentiment analysis, and question answering. However, coreference resolution models often exhibit gender bias, associating certain roles or attributes more with one gender over another. This project explores ...

In this project, I will use a transformer-based model to perform coreference resolution on the **Gendered Ambiguous Pronoun (GAP) dataset**. The project focuses on identifying and addressing gender bias within coreference resolution tasks.

## Dataset Background
The GAP dataset, developed by Google AI Language, is a gender-balanced corpus designed to help address gender bias in coreference resolution. The dataset contains examples where gendered pronouns (e.g., "he", "she") refer ambiguously to potential antecedents within the text. It provides annotated pairs of pronouns and candidate names, which the model needs to link correctly. This dataset is widely used for studying and reducing gender bias in coreference tasks.

The GAP dataset is available on [Kaggle](https://www.kaggle.com/c/gendered-pronoun-resolution) and Google Research’s [GitHub repository](https://github.com/google-research-datasets/gap-coreference). Previous work on this dataset has focused on:

- Fine-tuning models like BERT to achieve accurate coreference resolution.
- Developing metrics to evaluate model fairness across genders.
- Applying debiasing techniques to reduce model bias in gender-specific predictions.

## Project Purpose
In this project, I will first work with English-language coreference resolution to build a foundational understanding of bias in models. This initial approach will help prepare for a more complex challenge in my native language, Spanish. In Spanish, nouns and adjectives are gendered, which introduces additional nuances to coreference resolution. This complexity makes Spanish coreference resolution inherently more challenging, and ensuring fairness becomes even more critical.

### Project Steps
1. **Data Preparation and Exploration**: Load and explore the GAP dataset to understand its structure.
2. **Baseline Model Building and Evaluation**: Train a baseline transformer model for coreference resolution.
3. **Fairness Evaluation**: Quantify bias in the baseline model across genders.
4. **Bias Mitigation Techniques**: Apply debiasing methods to improve fairness in coreference resolution.
5. **Post-Debiasing Evaluation**: Re-evaluate the model to assess improvements in fairness.

Let's begin by loading and exploring the dataset.


## Step 1: Importing Libraries and Loading the GAP Dataset

I will start by importing the necessary libraries and loading the GAP dataset for coreference resolution. The dataset provides examples where pronouns ambiguously refer to two possible entities, allowing us to train a model to correctly resolve these references in a fair and unbiased manner.


In [1]:
# Importing essential libraries
import tensorflow as tf
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, classification_report
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the GAP dataset (replace with local path if necessary)
url = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv"
data = pd.read_csv(url, delimiter='\t')

# Display first few rows
data.head()


2024-11-04 22:02:45.557304: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-04 22:02:45.635697: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-11-04 22:02:45.954792: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera


## Dataset Structure

The GAP dataset contains the following columns:
- **Text**: The passage containing an ambiguous pronoun.
- **Pronoun**: The pronoun in question.
- **A** and **B**: The two possible entities that the pronoun may refer to.
- **A-coref** and **B-coref**: Labels indicating if the pronoun refers to A or B.
- **Gender**: Gender of the pronoun (e.g., "he", "she").

These columns provide the context needed to train a model for pronoun disambiguation and coreference resolution.

I will preprocess the dataset by tokenizing the text, encoding the pronoun and candidate pairs, and splitting the data into training and test sets.


In [10]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              2000 non-null   object
 1   Text            2000 non-null   object
 2   Pronoun         2000 non-null   object
 3   Pronoun-offset  2000 non-null   int64 
 4   A               2000 non-null   object
 5   A-offset        2000 non-null   int64 
 6   A-coref         2000 non-null   bool  
 7   B               2000 non-null   object
 8   B-offset        2000 non-null   int64 
 9   B-coref         2000 non-null   bool  
 10  URL             2000 non-null   object
dtypes: bool(2), int64(3), object(6)
memory usage: 144.7+ KB


In [11]:
data.describe(include='all')

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
count,2000,2000,2000,2000.0,2000,2000.0,2000,2000,2000.0,2000,2000
unique,2000,1999,9,,1793,,2,1774,,2,1834
top,development-1,"According to her mother, Tatyana Vladimovna, D...",her,,Elizabeth,,False,Mary,,False,http://en.wikipedia.org/wiki/Wilhelmina_Slater
freq,1,2,534,,7,,1126,8,,1075,4
mean,,,,324.9635,,239.778,,,300.5355,,
std,,,,98.788591,,111.15768,,,113.226357,,
min,,,,3.0,,0.0,,,16.0,,
25%,,,,274.0,,179.75,,,237.0,,
50%,,,,316.0,,239.0,,,294.0,,
75%,,,,370.0,,301.25,,,358.0,,


Identify if any rows have missing values, especially in essential columns like Text, Pronoun, A, and B.

In [12]:
data.isnull().sum()

ID                0
Text              0
Pronoun           0
Pronoun-offset    0
A                 0
A-offset          0
A-coref           0
B                 0
B-offset          0
B-coref           0
URL               0
dtype: int64

Check the balance of classes for A-coref and B-coref labels to see if the dataset is balanced between pronouns that refer to A vs. B.

In [13]:
data['A-coref'].value_counts(normalize=True)  # Distribution of labels for `A`
data['B-coref'].value_counts(normalize=True)  # Distribution of labels for `B`

B-coref
False    0.5375
True     0.4625
Name: proportion, dtype: float64

Determine the distribution of male vs. female pronouns to see if the dataset is gender-balanced, as this could impact model fairness.

In [16]:
data['Pronoun'].value_counts(normalize=True)  # Proportion of each pronoun

Pronoun
her    0.2670
his    0.2480
she    0.1245
he     0.1175
She    0.0895
He     0.0690
him    0.0490
Her    0.0190
His    0.0165
Name: proportion, dtype: float64

 Calculate the average and range of text lengths to see if any tokenization adjustments (e.g., truncation or max length) are needed for BERT.

In [17]:
data['Text_Length'] = data['Text'].apply(lambda x: len(x.split()))  # Word count
data['Text_Length'].describe()  # Summary of text length statistics

count    2000.00000
mean       71.20350
std        20.52705
min        16.00000
25%        58.00000
50%        68.00000
75%        82.00000
max       204.00000
Name: Text_Length, dtype: float64

Check if entities A and B overlap in the text. Analyzing how frequently the entities appear close together can give insights into the difficulty of the coreference resolution task.

In [18]:
# Count occurrences of `A` and `B` within the text
data['A_in_Text'] = data.apply(lambda x: x['A'] in x['Text'], axis=1)
data['B_in_Text'] = data.apply(lambda x: x['B'] in x['Text'], axis=1)
print("Entity A in Text:", data['A_in_Text'].value_counts(normalize=True))
print("Entity B in Text:", data['B_in_Text'].value_counts(normalize=True))

Entity A in Text: A_in_Text
True    1.0
Name: proportion, dtype: float64
Entity B in Text: B_in_Text
True    1.0
Name: proportion, dtype: float64


Find the most frequently occurring entities and pronouns in the dataset, as this could reveal any potential biases or imbalances.

In [21]:
# Check most common entities and pronouns
print("Most common entities A:", data['A'].value_counts().head(10))
print("Most common entities B:", data['B'].value_counts().head(10))
print("Most common pronouns:", data['Pronoun'].value_counts())


Most common entities A: A
Elizabeth    7
Ellen        6
Jones        5
Maria        5
Margaret     5
Helen        5
Anne         5
Thomas       4
Alice        4
James        4
Name: count, dtype: int64
Most common entities B: B
Mary         8
Smith        5
Emily        5
Jackson      5
Alice        5
Margaret     4
King         4
Stephanie    4
Isabel       4
Daisy        4
Name: count, dtype: int64
Most common pronouns: Pronoun
her    534
his    496
she    249
he     235
She    179
He     138
him     98
Her     38
His     33
Name: count, dtype: int64


Understand the context in which ambiguous pronouns appear, which is essential for designing models that can resolve these ambiguities.

In [22]:
# Sample a few rows where A-coref or B-coref is True to see text context
data[data['A-coref'] == 1][['Text', 'Pronoun', 'A']].sample(5)
data[data['B-coref'] == 1][['Text', 'Pronoun', 'B']].sample(5)

Unnamed: 0,Text,Pronoun,B
760,"Ten seasons after they return to the Abbey, we...",her,Martha
761,This acquisition secured communication with He...,his,Pedro
1827,"Irene married her second husband, Harold E. Kn...",her,Ryan
1040,"Rosie planned to move away with Darren, Demi a...",her,Demi
1157,Because of the close friendship between Marcie...,she,Marcie


Check if there’s any correlation between the length of the text and whether the pronoun refers to A or B, which might provide insights into patterns that the model could exploit.

In [25]:
data.groupby('A-coref')['Text_Length'].mean()

A-coref
False    72.080817
True     70.073227
Name: Text_Length, dtype: float64

In [26]:
# Calculate average text length by coreference label

data.groupby('B-coref')['Text_Length'].mean()


B-coref
False    70.835349
True     71.631351
Name: Text_Length, dtype: float64

## Step 2: Data Preprocessing

To train a transformer-based model for coreference resolution, I'll preprocess the text data by tokenizing the input and encoding the labels for candidate pairs.


In [3]:
# Initialize the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [7]:
print(data.iloc[0])

ID                                                    development-1
Text              Zoe Telford -- played the police officer girlf...
Pronoun                                                         her
Pronoun-offset                                                  274
A                                                    Cheryl Cassidy
A-offset                                                        191
A-coref                                                        True
B                                                           Pauline
B-offset                                                        207
B-coref                                                       False
URL               http://en.wikipedia.org/wiki/List_of_Teachers_...
Name: 0, dtype: object


In [4]:
# Example text, pronoun, and two entities
example_text = data.loc[0, 'Text']
example_pronoun = data.loc[0, 'Pronoun']
entity_A = data.loc[0, 'A']
entity_B = data.loc[0, 'B']

In [8]:
# Format the text with each entity as a possible referent
text_with_A = f"{example_text} [SEP] Pronoun: {example_pronoun} [SEP] Entity: {entity_A}"
text_with_B = f"{example_text} [SEP] Pronoun: {example_pronoun} [SEP] Entity: {entity_B}"

In [9]:
# Tokenize the text
tokens_A = tokenizer(text_with_A, padding='max_length', truncation=True, return_tensors="tf")
tokens_B = tokenizer(text_with_B, padding='max_length', truncation=True, return_tensors="tf")

# Display tokenized output
tokens_A['input_ids'], tokens_B['input_ids']

2024-11-04 22:55:27.667027: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-11-04 22:55:27.668816: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


(<tf.Tensor: shape=(1, 512), dtype=int32, numpy=
 array([[  101, 11199, 10093,  3877,  1011,  1011,  2209,  1996,  2610,
          2961,  6513,  1997,  4079,  1010,  8538,  1012, 14019,  2011,
          4079,  1999,  1996,  2345,  2792,  1997,  2186,  1015,  1010,
          2044,  2002,  7771,  2007,  8437,  1010,  1998,  2003,  2025,
          2464,  2153,  1012, 18188,  2726,  2209, 19431, 13737,  1010,
         15595,  1005,  1055,  2767,  1998,  2036,  1037,  2095,  2340,
         11136,  1999,  4079,  1005,  1055,  2465,  1012, 14019,  2014,
          6898,  2206,  4079,  1005,  1055,  6040,  2044,  2002,  2876,
          1005,  1056,  2031,  3348,  2007,  2014,  2021,  2101, 11323,
          2023,  2001,  2349,  2000,  2032,  9105, 26076,  2125,  2014,
          2767, 15595,  1012,   102,  4013,  3630,  4609,  1024,  2014,
           102,  9178,  1024, 19431, 13737,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,

In [None]:
# Tokenize the text
tokens_A = tokenizer(text_with_A, padding='max_length', truncation=True, return_tensors="tf")
tokens_B = tokenizer(text_with_B, padding='max_length', truncation=True, return_tensors="tf")

# Display tokenized output
tokens_A['input_ids'], tokens_B['input_ids']

In [None]:
# Initialize the BERT tokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the examples in the GAP dataset
def tokenize_examples(texts, pronouns, entities):
    tokenized_inputs = []
    for text, pronoun, entity in zip(texts, pronouns, entities):
        # Combine text, pronoun, and entity for coreference context
        inputs = f"{text} [SEP] Pronoun: {pronoun} [SEP] Entity: {entity}"
        tokenized_inputs.append(inputs)
    return tokenizer(tokenized_inputs, padding=True, truncation=True, return_tensors="tf")

# Tokenize pronoun and candidate entity pairs
tokenized_data_A = tokenize_examples(data['Text'], data['Pronoun'], data['A'])
tokenized_data_B = tokenize_examples(data['Text'], data['Pronoun'], data['B'])

# Encode labels
data['A-label'] = data['A-coref'].astype(int)
data['B-label'] = data['B-coref'].astype(int)

# Split into training and test sets
X_train_A, X_test_A, y_train_A, y_test_A = train_test_split(tokenized_data_A['input_ids'], data['A-label'], test_size=0.2, random_state=42)
X_train_B, X_test_B, y_train_B, y_test_B = train_test_split(tokenized_data_B['input_ids'], data['B-label'], test_size=0.2, random_state=42)


## Step 3: Model Building and Training

Now, I'll build a transformer-based model using BERT to predict which entity a pronoun refers to. This model will be trained on binary labels (0 or 1) to classify whether each pronoun points to `A` or `B` for every example in the dataset.


In [None]:
# Load a pre-trained BERT model for sequence classification
model_A = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model_B = TFAutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Define optimizer and compile the model
optimizer = Adam(learning_rate=3e-5)
model_A.compile(optimizer=optimizer, loss=model_A.compute_loss, metrics=['accuracy'])
model_B.compile(optimizer=optimizer, loss=model_B.compute_loss, metrics=['accuracy'])

# Train the models on the tokenized data
history_A = model_A.fit(
    X_train_A, y_train_A,
    validation_data=(X_test_A, y_test_A),
    epochs=3,  # Keep epochs small initially to observe baseline performance
    batch_size=16
)

history_B = model_B.fit(
    X_train_B, y_train_B,
    validation_data=(X_test_B, y_test_B),
    epochs=3,
    batch_size=16
)


## Step 3.1: Baseline Model Evaluation

To understand the performance of our initial coreference resolution model, I will evaluate it on the test set. This will provide baseline metrics such as accuracy and F1 score.


In [None]:
# Predictions on test set for both models
predictions_A = model_A.predict(X_test_A).logits
predictions_B = model_B.predict(X_test_B).logits

# Get predicted labels
predicted_labels_A = np.argmax(predictions_A, axis=1)
predicted_labels_B = np.argmax(predictions_B, axis=1)

# Calculate accuracy and F1 score for each model
accuracy_A = accuracy_score(y_test_A, predicted_labels_A)
f1_A = f1_score(y_test_A, predicted_labels_A)

accuracy_B = accuracy_score(y_test_B, predicted_labels_B)
f1_B = f1_score(y_test_B, predicted_labels_B)

print(f"Model A - Test Accuracy: {accuracy_A:.2f}, F1 Score: {f1_A:.2f}")
print(f"Model B - Test Accuracy: {accuracy_B:.2f}, F1 Score: {f1_B:.2f}")


## Step 3.2: Gender Bias Analysis

As an initial step in evaluating fairness, I'll separate examples by gender and calculate accuracy and F1 scores for each gender. This will allow me to observe if the model shows any gender bias in its predictions.


In [None]:
# Separate examples by gender
male_indices = data[data['Pronoun'] == 'he'].index
female_indices = data[data['Pronoun'] == 'she'].index

# Model A performance by gender
accuracy_A_male = accuracy_score(y_test_A[male_indices], predicted_labels_A[male_indices])
accuracy_A_female = accuracy_score(y_test_A[female_indices], predicted_labels_A[female_indices])

print(f"Model A - Male Pronouns Accuracy: {accuracy_A_male:.2f}")
print(f"Model A - Female Pronouns Accuracy: {accuracy_A_female:.2f}")

# Similarly, evaluate for Model B if necessary.
