# liaisons-preprocess

This notebook outlines the preprocessing steps undertaken to generate datasets used for testing the [liaisons](https://github.com/coding-kelps/liaisons) client and [benchmarking open-source models in predicting argument relations](https://github.com/coding-kelps/liaisons-experiments).

## About the Dataset

For this preprocessing task, we utilized IBM's "Claim Stance Dataset" as the source. This dataset comprises 2,394 labeled claims across 55 controversial topics, collected from Wikipedia. Each claim is labeled based on its stance towards the topic, either "PRO" (supporting the topic) or "CON" (opposing the topic), making it an excellent resource for relation-based argument mining tasks.

As of now, the dataset is available on [HuggingFace](https://huggingface.co/datasets/ibm/claim_stance). Further details on the original dataset can be found in the paper *Stance Classification of Context-Dependent Claims* (Bar-Haim et al., 2017).

## About the Preprocessing

The primary aim of this preprocessing is to create a representative sample of the dataset, thereby reducing computational costs for benchmarking and testing related software.

In [None]:
from datasets import load_dataset

# Load dataset from HuggingFace (https://huggingface.co/datasets/ibm/claim_stance)
ds = load_dataset("ibm/claim_stance", "claim_stance")

# Extract train subset as pandas dataframe
df = ds["train"].to_pandas()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Calculate value distribution proportions of the "claims.stance" column
proportions = df['claims.stance'].value_counts(normalize=True)

# Plotting the distribution
sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of stances over claims")
plt.xlabel('Stance')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Calculate value distribution proportions of the "topicTarget" column
proportions = df['topicTarget'].value_counts(normalize=True)

plt.figure(figsize=(12, 8))
ax = sns.barplot(x=proportions.index, y=proportions.values)
plt.title('Proportional distribution of topics over claims')
plt.xlabel('Topic')
plt.ylabel('Frequency')

# Fix ticks position to avoid hazardous position
# https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
ax.set_xticks(ax.get_xticks())
# Rotate labels and align to the right
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode="anchor")

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()
plt.show()

In [None]:
# Compute the counts of each combination of 'claims.stance' and 'topicTarget'
combination_counts = df.groupby(['claims.stance', 'topicTarget']).size()

# Calculate the number of unique combinations
num_combinations = len(combination_counts)

# Compute the number of entries per combination to get roughly 100 entries in total
# Ensure that it's not greater than the minimum count among the combinations
entries_per_combination = min(int(100 / num_combinations), combination_counts.min())

sampled_df = df.groupby(['claims.stance', 'topicTarget'], group_keys=False).apply(
    lambda x: x.sample(entries_per_combination, replace=True)
).reset_index(drop=True)

In [None]:
# Calculate value distribution proportions of the "claims.stance" column
proportions = sampled_df['claims.stance'].value_counts(normalize=True)

# Plotting the distribution
sns.barplot(x=proportions.index, y=proportions.values)
plt.title("Proportional distribution of stances over claims")
plt.xlabel('Stance')
plt.ylabel('Frequency')
plt.show()

In [None]:
# Calculate value distribution proportions of the "topicTarget" column
proportions = sampled_df['topicTarget'].value_counts(normalize=True)

plt.figure(figsize=(12, 8))
ax = sns.barplot(x=proportions.index, y=proportions.values)
plt.title('Proportional distribution of topics over claims')
plt.xlabel('Topic')
plt.ylabel('Frequency')

# Fix ticks position to avoid hazardous position
# https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
ax.set_xticks(ax.get_xticks())
# Rotate labels and align to the right
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode="anchor")

# Adjust layout to ensure everything fits without overlap
plt.tight_layout()
plt.show()

In [None]:
# Extract essentials column for argument relation prediction task
clean_df = sampled_df.loc[:, ["topicText", "claims.claimCorrectedText", "claims.stance"]]

# Rename column to fit to liaisons-experiments data template
clean_df = clean_df.rename(columns={"topicText": "argument_a", "claims.claimCorrectedText": "argument_b", "claims.stance": "relation"})

# Rename relation value categories
clean_df['relation'] = clean_df['relation'].replace({"PRO": "support", "CON": "attack"})

In [None]:
# Save data to csv
clean_df.to_csv("./ibm_claim_stance_binary_sample_100.csv")

## Bibliography
- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.