# liaisons-experiments - Large Language Models Benchmarkings for Relation-based Argument Mining

This notebook make a evaluation of the Large Language Model landscape ability for micro-scale Relation-based Argument Mining tasks.

## About the Task

This work is a modest continuation of a previous work (Gorur et al., 2024), limiting the computing cost by highly reducing the size of the dataset.

The actual task of this evaluation consists in measuring each model capability to predict the logical relation between two arguments on controversial topics collected from wikipedia (Bar-Haim et al., 2017).  
Predicted relation can be of either 2 or 3 classes depending of the relation dimension configuration:
- When *binary*, a relation can either be *support* (e.g., "Arg A logically supports Arg B") or *attack* (e.g., "Arg A logically contradicts Arg B")
- When *ternary*, a relation can either be *support* (e.g., "Arg A logically supports Arg B"), *attack* (e.g., "Arg A logically contradicts Arg B"), or *unrelated* (e.g., "Arg A is logically irevelant to Arg B")  
  
For example, the first argument `ASEAN has subscribed to the notion of democratic peace` `attack` the second argument `This house would disband ASEAN`.

In [None]:
from dotenv import load_dotenv

# Load the .env file to safely retrieve HuggingFace token for the task dataset,
# but also for the platform-hosted LLM
load_dotenv()

## Selected Models

Capitalizing on the growing trend of open-source LLMs, this research investigates models like phi3, gemma2, and llama3 that are accessible even to users without specialized hardware.

Expanding the evaluation, this work also included larger, platform-hosted models (e.g., gpt-3.5-turbo-0125, gemini-1.5-pro...). Their ease of scaling makes them particularly attractive for further macro-scale argument mining feature development.

### Hyperparameters Configuration

Following previous hyperparamters search (Gorur et al., 2024), `temperature` and `top_p` have respectively been set to 0.7 and 1 for better results. The `max_tokens` hyperparameter also have been set to a minimum to generate the expected classes ("support"/"attack"/"unrelated") to enable a crucial computation cost cut. However, the minimum value needed differs from the tokenizer used by each models, leading the a variation of this value.

### Pipeline Acceleration

Taking advantage of platform-hosted models infrastructure, the benchmarking framework propose a multithreading feature, configurable through the `num_workers` parameter. Enabling a critical performance improvement.


In [None]:
from liaisons_experiments.experiments.multi_experiment import MultiExperiment
from tqdm.notebook import tqdm
from langchain_community.chat_models import ChatOllama
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from langchain_google_genai import GoogleGenerativeAI, HarmCategory, HarmBlockThreshold
import os

self_hosted_llms = [
    (ChatOllama(
        model="llama3:8b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 5
    }),
    (ChatOllama(
        model="phi3:3.8b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 10
    }),
    (ChatOllama(
        model="phi3:14b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 3
    }),
    (ChatOllama(
        model="gemma:2b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 10
    }),
    (ChatOllama(
        model="gemma:2b-it",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 10
    }),
    (ChatOllama(
        model="gemma:7b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 5
    }),
    (ChatOllama(
        model="gemma:7b-it",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 5
    }),
    (ChatOllama(
        model="gemma2:2b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 10
    }),
    (ChatOllama(
        model="gemma2:2b-it",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 10
    }),
    (ChatOllama(
        model="gemma2:9b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 5
    }),
    (ChatOllama(
        model="gemma2:9b-it",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 5
    }),
    (ChatOllama(
        model="gemma2:27b",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 2
    }),
    (ChatOllama(
        model="gemma2:27b-it",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
    ), {
        "num_workers": 2
    }),
]

platform_hosted_llms = [
    (ChatOpenAI(
        model="gpt-3.5-turbo-0125",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_OPENAI_API_KEY"],
    ), {
        "num_workers": 16,
    }),
    (ChatOpenAI(
        model="gpt-4-turbo-2024-04-09",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_OPENAI_API_KEY"],
    ), {
        "num_workers": 1,
    }),
    (ChatOpenAI(
        model="gpt-4o-2024-05-13",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_OPENAI_API_KEY"],
    ), {
        "num_workers": 1,
    }),
    (ChatOpenAI(
        model="gpt-4o-mini-2024-07-18",
        temperature=0.7,
        max_tokens=2,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_OPENAI_API_KEY"],
    ), {
        "num_workers": 16,
    }),
    (ChatAnthropic(
        model="claude-3-haiku-20240307",
        temperature=0.7,
        max_tokens=3,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_ANTHROPIC_API_KEY"],
    ),{
        "num_workers": 2,
    }),
    (ChatAnthropic(
        model="claude-3-sonnet-20240229",
        temperature=0.7,
        max_tokens=3,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_ANTHROPIC_API_KEY"],
    ),{
        "num_workers": 2,
    }),
    (ChatAnthropic(
        model="claude-3-opus-20240229",
        temperature=0.7,
        max_tokens=3,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_ANTHROPIC_API_KEY"],
    ),{
        "num_workers": 1,
    }),
    (ChatAnthropic(
        model="claude-3-5-sonnet-20240620",
        temperature=0.7,
        max_tokens=3,
        top_p=1,
        api_key=os.environ["LIAISONS_EXPERIMENTS_ANTHROPIC_API_KEY"],
    ),{
        "num_workers": 2,
    }),
    (GoogleGenerativeAI(
        model="gemini-1.5-flash",
        temperature=0.7,
        max_output_tokens=2,
        top_p=1,
        safety_settings={
            HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
            HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
            HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        },
        google_api_key=os.environ["LIAISONS_EXPERIMENTS_GOOGLE_API_KEY"],
    ),{
        "num_workers": 16,
    }),
    (GoogleGenerativeAI(
        model="gemini-1.5-pro",
        temperature=0.7,
        max_output_tokens=2,
        top_p=1,
        safety_settings={
            HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
            HarmCategory.HARM_CATEGORY_HARASSMENT: HarmBlockThreshold.BLOCK_NONE,
            HarmCategory.HARM_CATEGORY_HATE_SPEECH: HarmBlockThreshold.BLOCK_NONE,
        },
        google_api_key=os.environ["LIAISONS_EXPERIMENTS_GOOGLE_API_KEY"],
    ),{
        "num_workers": 5,
    }),
]

exps = MultiExperiment([*self_hosted_llms, *platform_hosted_llms], tqdm=tqdm)

In [None]:
from datasets import load_dataset

hf_token = os.environ.get("LIAISONS_HUGGING_FACE_API_KEY")

dataset = load_dataset("coding-kelps/liaisons-claim-stance-sample", token=hf_token)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

def plot_binary_results(binary_results, title: str | None = None):
    binary_plot_results = pd.merge(binary_results.f1_scores, binary_results.metadata) \
        .melt(id_vars='model_name', var_name='Metric', value_name='Value')

    # Set the size of the plot
    plt.figure(figsize=(14, 8))

    # Define a list of colors for the palette
    colors = ["#1F77B4", "#FF7F0F", "#2BA02B", "#D62727"]

    # Create a grouped bar plot
    ax = sns.barplot(data=binary_plot_results, x='model_name', y='Value', hue='Metric', palette=colors)

    plt.title(title)
    plt.xlabel("Model Name")
    plt.ylabel("Benchmarks")

    # Fix ticks position to avoid hazardous position
    # https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
    ax.set_xticks(ax.get_xticks())
    # Rotate labels and align to the right
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")

    # Show the plot
    plt.tight_layout()
    plt.show()

def plot_ternary_results(ternary_results, title: str | None = None):
    ternary_plot_results = pd.merge(ternary_results.f1_scores, ternary_results.metadata) \
        .melt(id_vars='model_name', var_name='Metric', value_name='Value')
    
    # Set the size of the plot
    plt.figure(figsize=(14, 8))
    
    
    # Define a list of colors for the palette
    colors = ["#1F77B4", "#FF7F0F", "#9467BD", "#2BA02B", "#D62727"]
    
    # Create a grouped bar plot
    ax = sns.barplot(data=ternary_plot_results, x='model_name', y='Value', hue='Metric', palette=colors)
    
    plt.title(title)
    plt.xlabel("Model Name")
    plt.ylabel("Benchmarks")
    
    # Fix ticks position to avoid hazardous position
    # https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set_xticklabels.html
    ax.set_xticks(ax.get_xticks())
    # Rotate labels and align to the right
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    
    # Show the plot
    plt.tight_layout()
    plt.show()

## Prompting Techniques

This benchmark builds upon previous research by Gorur et al. (2024) utilizing "few-shot" prompting. This technique involves providing X examples of desired behavior before presenting the actual prompt.  

However, a significant discrepancy emerged between prior results and our findings. To address this gap, we also explored "augmented few-shot" prompting, which incorporates an additional instructional line within the prompt.

In [None]:
def binary_ibm_few_shot_prompting(parent_argment: str, child_argument: str) -> str:
    prompt = f"""
Arg1: Even in the case of provocateurs, it can be an effective strategy to call their bluff, by offering them a chance to have a rational conversation. In this case, the failure to do so is their responsibility alone.
Arg2: No-platforming hinders productive discourse.
Relation: attack

Arg1: A country used to receiving ODA may be perpetually bound to depend on handouts (pp. 197).
Arg2: Government structures adapt to handle and distribute incoming ODA. As the funding from ODA is significant, countries have vested bureaucratic interest to remain bound to aid (pp. 197).
Relation: support

Arg1: Elections would limit the influence of lobbyists on the appointment of Supreme Court judges.
Arg2: The more individuals take part in a decision, as would be the case in a popular vote compared to a vote in the Senate, the harder it is to sway the outcome.
Relation: support

Arg1: ChatGPT will reach AGI level before 2030.
Arg2: To reach AGI it should be able to generate its own goals and intentions: where would it draw these from?
Relation: attack

Arg1: {parent_argment}
Arg2: {child_argument}
Relation: 
"""
    
    return prompt

def binary_ibm_augmented_few_shot_prompting(parent_argment: str, child_argument: str) -> str:
    prompt = f"""
Arg1: Even in the case of provocateurs, it can be an effective strategy to call their bluff, by offering them a chance to have a rational conversation. In this case, the failure to do so is their responsibility alone.
Arg2: No-platforming hinders productive discourse.
Relation: attack

Arg1: A country used to receiving ODA may be perpetually bound to depend on handouts (pp. 197).
Arg2: Government structures adapt to handle and distribute incoming ODA. As the funding from ODA is significant, countries have vested bureaucratic interest to remain bound to aid (pp. 197).
Relation: support

Arg1: Elections would limit the influence of lobbyists on the appointment of Supreme Court judges.
Arg2: The more individuals take part in a decision, as would be the case in a popular vote compared to a vote in the Senate, the harder it is to sway the outcome.
Relation: support

Arg1: ChatGPT will reach AGI level before 2030.
Arg2: To reach AGI it should be able to generate its own goals and intentions: where would it draw these from?
Relation: attack

---

What the relation between Arg1 and Arg2, respond with one word: support or attack:

Arg1: {parent_argment}
Arg2: {child_argument}
Relation: 
"""
    
    return prompt

In [None]:
binary_df = dataset['binary'].to_pandas()

In [None]:
few_shot_binary_results = exps.run_from_df(binary_df, binary_ibm_few_shot_prompting, relation_dim="binary")

plot_binary_results(few_shot_binary_results, title="Large Language Models for binary argumentative relation prediction over the IBM Debater preprocessed dataset sample using few shot prompting")

In [None]:
augmented_few_shot_binary_results = exps.run_from_df(binary_df, binary_ibm_augmented_few_shot_prompting, relation_dim="binary")

plot_binary_results(augmented_few_shot_binary_results, title="Large Language Models for binary argumentative relation prediction over the IBM Debater preprocessed dataset sample using few augmented shot prompting")

In [None]:
def ternary_ibm_few_shot_prompting(parent_argment: str, child_argument: str) -> str:
    prompt = f"""
Arg1: Even in the case of provocateurs, it can be an effective strategy to call their bluff, by offering them a chance to have a rational conversation. In this case, the failure to do so is their responsibility alone.
Arg2: No-platforming hinders productive discourse.
Relation: attack

Arg1: ChatGPT will reach AGI level before 2030.
Arg2: Government structures adapt to handle and distribute incoming ODA. As the funding from ODA is significant, countries have vested bureaucratic interest to remain bound to aid (pp. 197).
Relation: unrelated

Arg1: Elections would limit the influence of lobbyists on the appointment of Supreme Court judges.
Arg2: The more individuals take part in a decision, as would be the case in a popular vote compared to a vote in the Senate, the harder it is to sway the outcome.
Relation: support

Arg1: A country used to receiving ODA may be perpetually bound to depend on handouts (pp. 197).
Arg2: To reach AGI it should be able to generate its own goals and intentions: where would it draw these from?
Relation: unrelated

Arg1: {parent_argment}
Arg2: {child_argument}
Relation: 
"""
    
    return prompt

def ternary_ibm_augmented_few_shot_prompting(parent_argment: str, child_argument: str) -> str:
    prompt = f"""
Arg1: Even in the case of provocateurs, it can be an effective strategy to call their bluff, by offering them a chance to have a rational conversation. In this case, the failure to do so is their responsibility alone.
Arg2: No-platforming hinders productive discourse.
Relation: attack

Arg1: ChatGPT will reach AGI level before 2030.
Arg2: Government structures adapt to handle and distribute incoming ODA. As the funding from ODA is significant, countries have vested bureaucratic interest to remain bound to aid (pp. 197).
Relation: unrelated

Arg1: Elections would limit the influence of lobbyists on the appointment of Supreme Court judges.
Arg2: The more individuals take part in a decision, as would be the case in a popular vote compared to a vote in the Senate, the harder it is to sway the outcome.
Relation: support

Arg1: A country used to receiving ODA may be perpetually bound to depend on handouts (pp. 197).
Arg2: To reach AGI it should be able to generate its own goals and intentions: where would it draw these from?
Relation: unrelated

---

What the relation between Arg1 and Arg2, respond with one word: support, attack, or unrelated:

Arg1: {parent_argment}
Arg2: {child_argument}
Relation: 
"""
    
    return prompt

In [None]:
ternary_df = dataset['ternary'].to_pandas()

In [None]:
few_shot_ternary_results = exps.run_from_df(ternary_df, ternary_ibm_few_shot_prompting, relation_dim="ternary")

plot_ternary_results(few_shot_ternary_results, title="Large Language Models for ternary argumentative relation prediction over the IBM Debater preprocessed dataset sample using few shot prompting")

In [None]:
augmented_few_shot_ternary_results = exps.run_from_df(ternary_df, ternary_ibm_augmented_few_shot_prompting, relation_dim="ternary")

plot_ternary_results(augmented_few_shot_ternary_results, title="Large Language Models for ternary argumentative relation prediction over the IBM Debater preprocessed dataset sample using augmented few shot prompting")

## Bibliography
- Bar-Haim, R., Bhattacharya, I., Dinuzzo, F., Saha, A., and Slonim, N. (2017). Stance classification of context dependent claims. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 251–261.
- Gorur, D., Rago, A. and Toni, F. (2024). Can Large Language Models perform Relation-based Argument Mining? [online] arXiv.org. doi:https://doi.org/10.48550/arXiv.2402.11243.