<a href="https://colab.research.google.com/github/dylan5lin/MP_AnnotatorDisagreement/blob/main/LLM_as_annotator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using LLM's as annotators using MultiPiCo dataset from LeWiDi 2025.
Goal of this experiment is to find out whether our selected Large Language Model, DeepSeek-V3, is capable of capturing annotator disagreement. First, we use Zero-Shot prompting. Afterwards, Few-Shot prompting will be applied as well to investigate its learning ability. The dataset that will be used will come from the third edition of the Learning With Disagreements shared task as part of SemEval. The MultiPiCo dataset contains annotated data on an irony detection task. DeepSeek-V3 will be tasked to provide a soft label distribution, with binary classification (1=ironic, 0=non-ironic).

# Required installations

In [7]:
%pip install openai



# Loading the dataset

In [8]:
import json
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [9]:
file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/MP_train.json'
with open(file_path, 'r') as file:
  data = json.load(file)

In [16]:
import pandas as pd
df = pd.DataFrame(data)
df = df.transpose()
# For this experiment, we will use Dutch annotation tasks only.
dutch_df = df[df['lang'] == 'nl']
print(dutch_df.head())
print(len(dutch_df))

      annotation task                                               text  \
8134  irony detection  {'post': 'De dood is minder dan het geraamte, ...   
8135  irony detection  {'post': 'Die vraag of leeftijd was wel een be...   
8143  irony detection  {'post': '@USER Gefeliciteerd!', 'reply': '@US...   
8144  irony detection  {'post': 'Gezamenlijke maaltijden zijn ontzett...   
8146  irony detection  {'post': 'Tegenwoordig hebben veel mensen een ...   

                              annotators number of annotations  \
8134  Ann215,Ann216,Ann226,Ann232,Ann236                     5   
8135  Ann215,Ann220,Ann225,Ann230,Ann237                     5   
8143  Ann215,Ann216,Ann225,Ann233,Ann235                     5   
8144  Ann215,Ann218,Ann219,Ann223,Ann228                     5   
8146  Ann215,Ann218,Ann224,Ann230,Ann231                     5   

                                            annotations  \
8134  {'Ann215': '0', 'Ann216': '0', 'Ann226': '0', ...   
8135  {'Ann215': '0', 'Ann22

# Loading DeepSeek-V3

In [11]:
from openai import OpenAI

client = OpenAI(api_key="sk-cc8ca3c49efc4c5782bec2087763e4fa", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hi!"},
    ],
    stream=False
)
print(response.choices[0].message.content)

Hello! 😊 How can I assist you today?


# Prompting techniques

We start off with Zero-Shot prompting. We feed the LLM the message and ask it to provide a soft label distribution.

In [21]:
def zero_shot_prompt(text, system_prompt):
    prompt = f"""
    Analyze the following post-reply message whether it is ironic.
    Provide a soft label distribution for whether the reply to the post is ironic (1) or non-ironic (0).
    Text: "{text}"
    Requirements for the Output format:
    A JSON object with keys '0'=non_ironic and '1'=ironic, values must be between 0 and 1.
    The values must sum up to 1.
    Example Output:
    {{'0':0.6, '1':0.4}}

    Analyze the provided text and return the output as mentioned above.
    """
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        stream=False
    )
    return response.choices[0].message.content

print(zero_shot_prompt(dutch_df['text'][0], "You are an assistant that annotates post-reply messages on irony"), dutch_df['text'][0])

  print(zero_shot_prompt(dutch_df['text'][0], "You are an assistant that annotates post-reply messages on irony"), dutch_df['text'][0])


{"0":0.8, "1":0.2} {'post': 'De dood is minder dan het geraamte, het leven…..', 'reply': '@USER Het is de jas die het geraamte omhuld en het karakter wat levenslust naar buiten brengt.'}


Next, we use Few-Shot prompting. We feed the LLM three examples, and then the message. We ask it to provide a soft-label distribution.

In [20]:
def few_shot_prompt(system_prompt, few_shot_examples, user_prompt):
  prompt = f"""
  Analyze the following post-reply message whether it is ironic.
  Provide a soft label distribution for whether the reply to the post is ironic (1) or non-ironic (0).
  A JSON object with keys '0'=non_ironic and '1'=ironic, values must be between 0 and 1.
    The values must sum up to 1.
    Example Output:
    {{'0':0.6, '1':0.4}}
    Analyze the provided text and return the output as mentioned above.
    """
  messages = []
  messages.append({"role": "system", "content": system_prompt})
  for user_message, assistant_message in few_shot_examples:
    messages.append({"role": "user", "content": prompt + '\n' + user_message})
    messages.append({"role": "assistant", "content": assistant_message})
  messages.append({"role": "user", "content": prompt + '\n' + user_prompt})
  response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    response_format={"type": "json_object"},
    stream=False
  )
  return response.choices[0].message.content



# Run dataset on LLM using Zero-shot and Few-shot prompting

In [48]:
def run_prompt_zeroshot(df):
  prompts = df['text'].tolist()
  results = df[['text', 'soft_label']]
  results['soft_label_prediction'] = ''
  system_prompt = "You are an assistant that annotates post-reply messages on irony"
  for prompt in prompts:
    results.loc[results['text'] == prompt, 'soft_label_prediction'] = zero_shot_prompt(prompt, system_prompt)
  return results
zeroshot_res = run_prompt_zeroshot(dutch_df[1:10])
def run_prompt_fewshot(df):
  prompts = df['text'].tolist()
  results = df[['text', 'soft_label']]
  results['soft_label_prediction'] = ''
  system_prompt = "You are an assistant that annotates post-reply messages on irony"
  few_shot_examples = []
  for prompt in prompts:
    results.loc[results['text'] == prompt, 'soft_label_prediction'] = few_shot_prompt(system_prompt, few_shot_examples, prompt)
  return

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['soft_label_prediction'] = ''


In [47]:
for item in zeroshot_res['soft_label']:
  try:
    item['non_ironic'] = item.pop('0.0')
    item['ironic'] = item.pop('1.0')
  except KeyError:
    break
predictions = []
for item in zeroshot_res['soft_label_prediction']:
  pred_ = json.loads(item)
  try:
    pred_['non_ironic'] = pred_.pop("0")
    pred_['ironic'] = pred_.pop("1")
    predictions.append(pred_)
  except KeyError:
    break
zeroshot_res['soft_label_prediction'] = predictions
print(zeroshot_res)

                                                   text  \
8135  {'post': 'Die vraag of leeftijd was wel een be...   
8143  {'post': '@USER Gefeliciteerd!', 'reply': '@US...   
8144  {'post': 'Gezamenlijke maaltijden zijn ontzett...   
8146  {'post': 'Tegenwoordig hebben veel mensen een ...   
8147  {'post': 'Het is een puppet van Schwab. Het en...   
8148  {'post': 'Raheem Sterling is replaced by Pierr...   
8150  {'post': 'Ik ken het Dyslexie lettertype waar ...   
8151  {'post': 'Gedaan.  Als partner van een enorme ...   
8152  {'post': 'rellen =/= demonstreren. Als je je a...   

                              soft_label soft_label_prediction  \
8135    {'non_ironic': 1.0, 'ironic': 0}    {"0":0.9, "1":0.1}   
8143    {'non_ironic': 1.0, 'ironic': 0}    {"0":0.9, "1":0.1}   
8144  {'non_ironic': 0.4, 'ironic': 0.6}    {"0":0.3, "1":0.7}   
8146    {'non_ironic': 1.0, 'ironic': 0}    {"0":0.3, "1":0.7}   
8147  {'non_ironic': 0.2, 'ironic': 0.8}     {"0":0.3,"1":0.7}   
8148  {'non_i