# liaisons-preprocess

This notebook outlines the preprocessing steps undertaken to generate datasets used for testing the [liaisons](https://github.com/coding-kelps/liaisons) client and [benchmarking open-source models in predicting argument relations](https://github.com/coding-kelps/liaisons-experiments).

## Dataset

For this preprocessing task, we utilized IBM's "Claim Stance Dataset" as the source. This dataset comprises 2,394 labeled claims across 55 controversial topics, collected from Wikipedia. Each claim is labeled based on its stance towards the topic, either "PRO" (supporting the topic) or "CON" (opposing the topic), making it an excellent resource for relation-based argument mining tasks.

As of now, the dataset is available on [HuggingFace](https://huggingface.co/datasets/ibm/claim_stance). Further details on the original dataset can be found in the paper *Stance Classification of Context-Dependent Claims* (Bar-Haim et al., 2017).

## Preprocessing Notes

The primary aim of this preprocessing is to create a representative sample of the dataset, thereby reducing computational costs for benchmarking and testing related software.

In [None]:
from datasets import load_dataset

# Load dataset from HuggingFace (https://huggingface.co/datasets/ibm/claim_stance)
ds = load_dataset("ibm/claim_stance", "claim_stance")

In [None]:
# Extract train subset as pandas dataframe
df = ds["train"].to_pandas()

In [None]:
def stratified_sample(group, frac, keeper):
    # Calculate the number of samples for each unique value in the column to preserve
    counts = group[keeper].value_counts(normalize=True) * frac * len(group)
    counts = counts.round().astype(int)
    
    # Perform stratified sampling
    group = group.groupby(keeper, group_keys=False).apply(lambda x: x.sample(n=counts[x.name]), include_groups=True)

    return group

# Define the fraction to sample
frac = 100 / len(df)

# Apply stratified sampling within each group
sampled_df = df.groupby("topicId",).apply(lambda x: stratified_sample(x, frac, "claims.stance"), include_groups=True)

# Reset index if necessary
sampled_df = sampled_df.reset_index(drop=True)

In [None]:
# Extract essentials column for argument relation prediction task
clean_df = sampled_df.loc[:, ["topicText", "claims.claimCorrectedText", "claims.stance"]]

# Rename column to fit to liaisons-experiments data template
clean_df = clean_df.rename(columns={"topicText": "argument_a", "claims.claimCorrectedText": "argument_b", "claims.stance": "relation"})

# Rename relation value categories
clean_df['relation'] = clean_df['relation'].replace({"PRO": "support", "CON": "attack"})

In [None]:
# Save data to csv
clean_df.to_csv("./ibm_claim_stance_sample_100.csv")

## Bibliography
- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.