# liaisons-preprocess - IBM's Claim Stance Dataset

This notebook outlines the preprocessing steps undertaken to generate datasets used for testing the [liaisons](https://github.com/coding-kelps/liaisons) client and [benchmarking open-source models in predicting argument relations](https://github.com/coding-kelps/liaisons-experiments).

## About the Dataset

For this preprocessing task, we utilized IBM's "Claim Stance Dataset" as the source, a dataset of claims over controversial topics, collected from Wikipedia. Each claim is labeled based on its stance towards the topic, either "PRO" (supporting the topic) or "CON" (opposing the topic), making it an excellent resource for relation-based argument mining tasks.

As of now, the dataset is available on [HuggingFace](https://huggingface.co/datasets/ibm/claim_stance). Further details on the original dataset can be found in the paper *Stance Classification of Context-Dependent Claims* (Bar-Haim et al., 2017).

## About the Preprocessing

The primary aim of this preprocessing is to create representative samples of the dataset, roughly 100 entries, enabling benchmarking with limited computing resources. Secondly, previous models in the field of relation-based argument mining have proven to give misleading benchmarks by achieving satisfactory results in specific domains (Gorur et al., 2024). To circumvent this, the preprocessing will also modify the distribution of claims to achieve a balanced plurality of stances and topics.

In [None]:
from datasets import load_dataset
import pandas as pd

# Load dataset from HuggingFace (https://huggingface.co/datasets/ibm/claim_stance)
ds = load_dataset("ibm/claim_stance", "claim_stance")

# Combine "train" and "test" datasets as no training
# will be conducted over this source
df = pd.concat([ds["train"].to_pandas(), ds["test"].to_pandas()]).reset_index(drop=True)

## Dataset Properties

The dataset features 2,394 labeled claims across 55 controversial topics. While the distribution of claim stances over the dataset is mostly equal (~55% PRO, ~45% CON), the distribution of topics being discussed is quite unbalanced. For example, the "unleash the free market" topic represents ~8% of the claims by itself. Also, claim stances are binary as they can only support or attack the related topic.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

"""
    Plot the dataset distribution of stances over claim
"""

# Calculate value distribution proportions of the "claims.stance" column
proportions = df["claims.stance"].value_counts(normalize=True)

# Plotting the distribution
sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of stances over claim")
plt.xlabel("Stance")
plt.ylabel("Frequency")
plt.show()

In [None]:
"""
    Plot the dataset distribution of topics over claim
"""

# Calculate value distribution proportions of the "topicTarget" column
proportions = df["topicTarget"].value_counts(normalize=True)

plt.figure(figsize=(15, 8))
ax = sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of topics over claim")
plt.xlabel("Topic")
plt.ylabel("Frequency")

# Fix ticks position to avoid hazardous position
# https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
ax.set_xticks(ax.get_xticks())
# Rotate labels and align to the right
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()
plt.show()

## Preprocessing notes

The preprocessing focus on the making of two samples:
- a **binary** sample which doesn't change the possible stance (or relation) a claim can have to a topic (either PRO or CON).
- a **ternary** sample which introduce a new *unrelated* value through rule-based data augmentation. New entries are generated by taking claims and associating them to topics deemed unrelated. 

In [None]:
"""
    Dataset preprocessing for binary sample (with PRO/CON stances) 
"""


# Compute the counts of each combination of 'claims.stance' and 'topicTarget'
combination_counts = df.groupby(['claims.stance', 'topicTarget']).size()

# Calculate the number of unique combinations
num_combinations = len(combination_counts)

# Make the sample about a hundred elements
# Take the number of combinations if it is greater than 100 to avoid empty dataset
sample_nb_entries = max(100, num_combinations)

# Compute the number of entries per combination to get roughly 100 entries in total
# Ensure that it's not greater than the minimum count among the combinations
entries_per_combination = min(int(sample_nb_entries / num_combinations), combination_counts.min())

binary_sampled_df = df.groupby(['claims.stance', 'topicTarget'], group_keys=False).apply(
    lambda x: x.sample(entries_per_combination, replace=True)
).reset_index(drop=True)

In [None]:
"""
    Generate "UNRELATED" claims to topic through rule-based data augmentation
"""

# These unrelated topics have been arbitrary selected
# and can be prone to debate.
unrelated_topics = [
    ("have children", "boxing"),
    ("governments should choose open source software", "re-engage with Myanmar"),
    ("the sale of violent video games to minors", "build hydroelectric dams"),
    ("the use of performance enhancing drugs in professional sports", "Holocaust denial"),
    ("the one-child policy of the republic of China", "intellectual property rights"),
    ("atheism", "wind power"),
    ("the monarchy", "raising the school leaving age to 18"),
    ("endangered species", "freedom of speech"),
    ("advertising", "the blockade of Gaza"),
    ("only teach abstinence for sex education in schools", "the right to bear arms"),
    ("gambling", "all nations have a right to nuclear weapons"),
    ("leaking of military documents", "austerity measures")
]

# The number of generated entries to make through data augmentation
nb_generated = 3

unrelated_claims_df = pd.DataFrame()

for topic_a, topic_b in unrelated_topics:
    matched_rows_a = df[df["topicTarget"] == topic_a]
    matched_rows_b = df[df["topicTarget"] == topic_b]

    if matched_rows_a.shape[0] >= nb_generated and matched_rows_b.shape[0] >= nb_generated:
        topic_a_sample = matched_rows_a.sample(n=nb_generated).reset_index(drop=True)
        topic_b_sample = matched_rows_b.sample(n=nb_generated).reset_index(drop=True)

        topic_a_sample[["topicTarget", "topicText"]], topic_b_sample[["topicTarget", "topicText"]] = \
            topic_b_sample[["topicTarget", "topicText"]], topic_a_sample[["topicTarget", "topicText"]]

        unrelated_claims_df = pd.concat([unrelated_claims_df, topic_a_sample, topic_b_sample], ignore_index=True)        

# Set claims stance for all generated entries to "UNRELATED"
unrelated_claims_df["claims.stance"] = "UNRELATED"

# Concat back the generated entries to the original dataset
data_augmented_df = pd.concat([df, unrelated_claims_df], ignore_index=True)

# Retrieve all the topics which haven't been augmented for the "UNRELATED" claim stance
valid_topic_targets = data_augmented_df[data_augmented_df['claims.stance'] == 'UNRELATED']['topicTarget'].unique()

# Remove all unaugmented topics to ensure that the dataset is balanced
data_augmented_df = data_augmented_df[data_augmented_df['topicTarget'].isin(valid_topic_targets)]

In [None]:
"""
    Dataset preprocessing for ternary sample (with PRO/CON/UNRELATED stances) 
"""

# Compute the counts of each combination of 'claims.stance' and 'topicTarget'
combination_counts = data_augmented_df.groupby(['claims.stance', 'topicTarget']).size()

# Calculate the number of unique combinations
num_combinations = len(combination_counts)

# Make the sample about a hundred elements
# Take the number of combinations if it is greater than 100 to avoid empty dataset
sample_nb_entries = max(100, num_combinations)

# Compute the number of entries per combination to get roughly 100 entries in total
# Ensure that it's not greater than the minimum count among the combinations
entries_per_combination = min(int(sample_nb_entries / num_combinations), combination_counts.min())

ternary_sampled_df = data_augmented_df.groupby(['claims.stance', 'topicTarget'], group_keys=False).apply(
    lambda x: x.sample(entries_per_combination, replace=True)
).reset_index(drop=True)

### Preprocessing Results Notes

The distribution of 'claims.stance' and 'topicTarget' has been rebalanced to achieve equal frequency across categories. However, it is important to note that the data augmentation used to generate unrelated claims has resulted in a reduction in the diversity of topics within the ternary sample.

In [None]:
"""
    Plot the binary and ternary samples distribution of stances over claim
"""

# Calculate value distribution proportions of the "claims.stance" column
bin_proportions = binary_sampled_df["claims.stance"].value_counts(normalize=True)

# Calculate value distribution proportions of the "claims.stance" column
ter_proportions = ternary_sampled_df["claims.stance"].value_counts(normalize=True)

fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharey=True)
fig.suptitle('Proportional distribution of stances over claim')

# Plotting the distribution
sns.barplot(ax=axes[0], x=bin_proportions.index, y=bin_proportions.values)
axes[0].set_title("Binary Sample")
axes[0].set_xlabel("Stance")
axes[0].set_ylabel("Frequency")

# Plotting the distribution
sns.barplot(ax=axes[1], x=ter_proportions.index, y=ter_proportions.values)
axes[1].set_title("Ternary Sample")
axes[1].set_xlabel("Stance")
axes[1].set_ylabel("Frequency")

plt.show()

In [None]:
"""
    Plot the binary sample distribution of topics over claim
"""

# Calculate value distribution proportions of the "topicTarget" column
proportions = binary_sampled_df["topicTarget"].value_counts(normalize=True)

plt.figure(figsize=(15, 8))
ax = sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of topics over claim in binary sample")
plt.xlabel("Topic")
plt.ylabel("Frequency")

# Fix ticks position to avoid hazardous position
# https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
ax.set_xticks(ax.get_xticks())
# Rotate labels and align to the right
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()
plt.show()

In [None]:
"""
    Plot the ternary sample distribution of topics over claim
"""

# Calculate value distribution proportions of the "topicTarget" column
proportions = ternary_sampled_df["topicTarget"].value_counts(normalize=True)

plt.figure(figsize=(15, 8))
ax = sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of topics over claim in ternary sample")
plt.xlabel("Topic")
plt.ylabel("Frequency")

# Fix ticks position to avoid hazardous position
# https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
ax.set_xticks(ax.get_xticks())
# Rotate labels and align to the right
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()
plt.show()

### Clean Samples

Finally, for the dataset to be used in further experiments, both samples need to be cleaned by removing the now useless columns and renaming the remaining columns to fit the template of the [experiment framework](https://github.com/coding-kelps/liaisons-experiments).

In [None]:
"""
    Clean dataset to follow expected template for benchmarking
"""

# Extract essentials column for argument relation prediction task
binary_clean_df = binary_sampled_df.loc[:, ["topicText", "claims.claimCorrectedText", "claims.stance"]]

# Rename column to fit to liaisons-experiments data template
binary_clean_df = binary_clean_df.rename(columns={"topicText": "parent_argument", "claims.claimCorrectedText": "child_argument", "claims.stance": "relation"})

# Rename relation value categories
binary_clean_df['relation'] = binary_clean_df['relation'].replace({"PRO": "support", "CON": "attack"})

binary_clean_df = binary_clean_df.reset_index(drop=True)

In [None]:
"""
    Clean dataset to follow expected template for benchmarking
"""

# Extract essentials column for argument relation prediction task
ternary_clean_df = ternary_sampled_df.loc[:, ["topicText", "claims.claimCorrectedText", "claims.stance"]]

# Rename column to fit to liaisons-experiments data template
ternary_clean_df = ternary_clean_df.rename(columns={"topicText": "parent_argument", "claims.claimCorrectedText": "child_argument", "claims.stance": "relation"})

# Rename relation value categories
ternary_clean_df['relation'] = ternary_clean_df['relation'].replace({"PRO": "support", "CON": "attack", "UNRELATED": "unrelated"})

ternary_clean_df = ternary_clean_df.reset_index(drop=True)

In [None]:
"""
    Push datasets to HuggingFace
"""

from datasets import Dataset, DatasetDict
import os

# Load pd.DataFrame as Dataset
binary_data = Dataset.from_pandas(binary_clean_df)
ternary_data = Dataset.from_pandas(ternary_clean_df)

# Sum up all generated datasets into a single DatasetDict
dataset = DatasetDict({
    'binary': binary_data,
    'ternary': ternary_data
})

hf_token = os.environ.get("LIAISONS_HUGGING_FACE_TOKEN")

# Then push to HuggingFace Repository
dataset.push_to_hub('coding-kelps/liaisons-ibm-debater-claim-stance', token=hf_token)

## Bibliography
- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.
- Gorur, D., Rago, A. and Toni, F. (2024). Can Large Language Models perform Relation-based Argument Mining? [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2402.11243.