# Using LLM's as annotators using MultiPiCo dataset from LeWiDi 2025.
Goal of this experiment is to find out whether our selected Large Language Model, DeepSeek-V3, is capable of capturing annotator disagreement. First, we use Zero-Shot prompting. Afterwards, Few-Shot prompting will be applied as well to investigate its learning ability. The dataset that will be used will come from the third edition of the Learning With Disagreements shared task as part of SemEval. The MultiPiCo dataset contains annotated data on an irony detection task. DeepSeek-V3 will be tasked to provide a soft label distribution, with binary classification (1=ironic, 0=non-ironic).

# Required installations

In [10]:
%pip install openai



# Loading the dataset

In [11]:
import json
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [12]:
file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/MP_train.json'
with open(file_path, 'r') as file:
  data = json.load(file)

In [13]:
import pandas as pd
df = pd.DataFrame(data)
df = df.transpose()
# For this experiment, we will use Dutch annotation tasks only.
dutch_df = df[df['lang'] == 'nl']
dutch_df.head()

Unnamed: 0,annotation task,text,annotators,number of annotations,annotations,soft_label,split,lang,other_info
8134,irony detection,"{'post': 'De dood is minder dan het geraamte, ...","Ann215,Ann216,Ann226,Ann232,Ann236",5,"{'Ann215': '0', 'Ann216': '0', 'Ann226': '0', ...","{'0.0': 1.0, '1.0': 0}",train,nl,"{'source': 'twitter', 'level': 1.0, 'language_..."
8135,irony detection,{'post': 'Die vraag of leeftijd was wel een be...,"Ann215,Ann220,Ann225,Ann230,Ann237",5,"{'Ann215': '0', 'Ann220': '0', 'Ann225': '0', ...","{'0.0': 1.0, '1.0': 0}",train,nl,"{'source': 'reddit', 'level': 1.0, 'language_v..."
8143,irony detection,"{'post': '@USER Gefeliciteerd!', 'reply': '@US...","Ann215,Ann216,Ann225,Ann233,Ann235",5,"{'Ann215': '0', 'Ann216': '0', 'Ann225': '0', ...","{'0.0': 1.0, '1.0': 0}",train,nl,"{'source': 'twitter', 'level': 2.0, 'language_..."
8144,irony detection,{'post': 'Gezamenlijke maaltijden zijn ontzett...,"Ann215,Ann218,Ann219,Ann223,Ann228",5,"{'Ann215': '1', 'Ann218': '1', 'Ann219': '0', ...","{'0.0': 0.4, '1.0': 0.6}",train,nl,"{'source': 'twitter', 'level': 1.0, 'language_..."
8146,irony detection,{'post': 'Tegenwoordig hebben veel mensen een ...,"Ann215,Ann218,Ann224,Ann230,Ann231",5,"{'Ann215': '0', 'Ann218': '0', 'Ann224': '0', ...","{'0.0': 1.0, '1.0': 0}",train,nl,"{'source': 'reddit', 'level': 1.0, 'language_v..."


# Loading DeepSeek-V3

In [14]:
from openai import OpenAI

client = OpenAI(api_key="sk-cc8ca3c49efc4c5782bec2087763e4fa", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hi!"},
    ],
    stream=False
)
print(response.choices[0].message.content)

Hello! 😊 How can I assist you today?


# Prompting techniques

We start off with Zero-Shot prompting. We feed the LLM the message and ask it to provide a soft label distribution.

In [16]:
def zero_shot_prompt(text, system_prompt):
    prompt = f"""
    Analyze the following post-reply message whether it is ironic.
    Provide a soft label distribution for whether the reply to the post is ironic (1) or non-ironic (0).
    Text: "{text}"
    Requirements for the Output format:
    A JSON object with keys "ironic" and "non_ironic", values must be between 0 and 1.
    The values must sum up to 1.
    Example Output:
    {{
      "ironic":0.6,
      "non_ironic":0.4
    }}

    Analyze the provided text and return the output as mentioned above.
    """
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        stream=False
    )
    return response.choices[0].message.content

print(zero_shot_prompt(dutch_df['text'][0], "You are an assistant that annotates post-reply messages on irony"), dutch_df['text'][0])

  print(zero_shot_prompt(dutch_df['text'][0], "You are an assistant that annotates post-reply messages on irony"), dutch_df['text'][0])


{
  "ironic": 0.3,
  "non_ironic": 0.7
} {'post': 'De dood is minder dan het geraamte, het leven…..', 'reply': '@USER Het is de jas die het geraamte omhuld en het karakter wat levenslust naar buiten brengt.'}


Next, we use Few-Shot prompting. We feed the LLM three examples, and then the message. We ask it to provide a soft-label distribution.

In [None]:
def few_shot_prompt