<a href="https://colab.research.google.com/github/dylan5lin/MP_AnnotatorDisagreement/blob/main/LLM_as_annotator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Using LLM's as annotators using MultiPiCo dataset from LeWiDi 2025.
Goal of this experiment is to find out whether our selected Large Language Model, DeepSeek-V3, is capable of capturing annotator disagreement. First, we use Zero-Shot prompting. Afterwards, Few-Shot prompting will be applied as well to investigate its learning ability. The dataset that will be used will come from the third edition of the Learning With Disagreements shared task as part of SemEval. The MultiPiCo dataset contains annotated data on an irony detection task. DeepSeek-V3 will be tasked to provide a soft label distribution, with binary classification (1=ironic, 0=non-ironic).

# Required installations

In [None]:
%pip install openai



# Loading the dataset

In [1]:
import json
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/MP_dev.json'
with open(file_path, 'r') as file:
  data = json.load(file)

In [3]:
import pandas as pd
df = pd.DataFrame(data)
df = df.transpose()

# Loading DeepSeek-V3

In [4]:
from openai import OpenAI

client = OpenAI(api_key="sk-cc8ca3c49efc4c5782bec2087763e4fa", base_url="https://api.deepseek.com")

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=[
        {"role": "system", "content": "You are a helpful assistant"},
        {"role": "user", "content": "Hi!"},
    ],
    stream=False
)
print(response.choices[0].message.content)

Hello! 😊 How can I assist you today?


# Prompting techniques

We start off with Zero-Shot prompting. We feed the LLM the message and ask it to provide a soft label distribution.

In [None]:
prompt_zs = f"""
    Analyze the following post-reply message whether it is ironic.
    The messages could be ironic or non-ironic. Provide a probability distribution for whether the reply to the post is ironic (1) or non-ironic (0).
    Requirements for the Output format:
    A JSON object with one key, "probability",
    the value represents a list of two elements. The first element being the value of non_ironic and the second element being ironic, values must be between 0 and 1.
    The values must sum up to 1.
    Example Output:
    {{"probabilities": [0.5, 0.5]}}

    Analyze the provided text and return the output as mentioned above.
    Text:
    """
def zero_shot_prompt(text, system_prompt, prompt):
    response = client.chat.completions.create(
        model="deepseek-chat",
        messages=[{"role": "system", "content": system_prompt},
            {"role": "user", "content": prompt + "\n" + str(text)}],
        response_format={"type": "json_object"},
        stream=False
    )
    return response.choices[0].message.content

print(zero_shot_prompt(df['text'][0], "You are an assistant that annotates post-reply messages on irony", prompt_zs))
print(df['soft_label'][0])

  print(zero_shot_prompt(df['text'][0], "You are an assistant that annotates post-reply messages on irony", prompt_zs))


{"probabilities": [0.3, 0.7]}
{'0.0': 0.4, '1.0': 0.6}


  print(df['soft_label'][0])


Next, we use Few-Shot prompting. We feed the LLM three examples, and then the message. We ask it to provide a soft-label distribution.

In [5]:
def few_shot_prompt(system_prompt, few_shot_examples, user_prompt, prompt):
  messages = []
  messages.append({"role": "system", "content": system_prompt})
  messages.append({"role": "system", "content": prompt})
  for user_message, assistant_message in few_shot_examples:
    messages.append({"role": "user", "content": '\n' + str(user_message) +'\nAnswer:'})
    messages.append({"role": "assistant", "content": f"{{'probability': {assistant_message}}}"})
  messages.append({"role": "user", "content": '\n' + str(user_prompt) +'\nAnswer:'})
  response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    response_format={"type": "json_object"},
    stream=False
  )
  return response.choices[0].message.content



# Run dataset on LLM using Zero-shot and Few-shot prompting

In [6]:
def run_prompt_zeroshot(df, prompt):
  prompts = df['text'].tolist()
  results = df[['lang', 'text', 'soft_label']]
  results['soft_label_prediction'] = ''
  system_prompt = "You are an assistant that annotates post-reply messages on irony"
  for text in prompts:
    results.loc[results['text'] == text, 'soft_label_prediction'] = zero_shot_prompt(text, system_prompt, prompt)
  return results

def run_prompt_fewshot(df, prompt):
  prompts = df['text'].tolist()
  results = df[['text', 'soft_label']]
  results['soft_label_prediction'] = ''
  system_prompt = "You are an assistant that annotates post-reply messages on irony"
  few_shot_examples = df.loc[['5', '190', '204', '45', '1209', '1585', '2201', '1291', '1415', '1670']][['text', 'soft_label']]
  new_df = df.drop(few_shot_examples.index)
  fix_output(few_shot_examples, 'soft_label')
  few_shot_examples['soft_label'] = few_shot_examples[1]
  few_shot_examples = few_shot_examples.drop(1, axis=1)
  few_shot_examples = few_shot_examples.values.tolist()
  for text in prompts:
    results.loc[results['text'] == text, 'soft_label_prediction'] = few_shot_prompt(system_prompt, few_shot_examples, text, prompt)
  return results

**Notes:** For the first run of the few-shot prompting: I used 5, 45 and 2201 as examples.

**Notes:** Second run few shot prompting: Added 7 new example prompts. Total of 10: [3 non-ironic, 3 ironic, 4 debatable] + Changed prompt

**Notes:** Third run few shot prompting: Changed order of feeding examples to LLM. Previous order:


1.   System prompt
2.   Prompt
3.   1st example
4.   Answer to 1st example
5.   Prompt
6.   2nd example
7.   Answer to 2nd example
8.   ...
9.   Prompt
10.  Post-reply pair to be answered
11.  Answer from LLM

New order:

1.   System prompt
2.   Prompt
3.   1st example
4.   Answer to 1st example
5.   2nd example
6.   Answer to 2nd example
7.   ...
8.   Post-reply pair to be answered
9.   Answer from LLM




---


**Notes:** First run of zero shot prompt used prompt_zs as seen above.

**Notes:** Second run of zero shot prompt used a new prompt to improve model performance.


In [None]:
def check_output(df):
  for item in df['soft_label']:
    try:
      item['non_ironic'] = item.pop('0.0')
      item['ironic'] = item.pop('1.0')
    except KeyError:
      pass
  predictions = []
  for item in df['soft_label_prediction']:
    pred_ = json.loads(item)
    try:

      pred_['non_ironic'] = pred_.pop("0")
      pred_['ironic'] = pred_.pop("1")
      predictions.append(pred_)
    except KeyError:
      pass
  df['soft_label_prediction'] = predictions
  return df

First we do zero shot prompting on the dataset. Afterwards, we will do few shot prompting on the dataset.

In [None]:
#zeroshot_res = run_prompt_zeroshot(df)
#zeroshot_res = check_output(zeroshot_res)

In [None]:
import json
print(zeroshot_res.head())
new_file = pd.DataFrame()
new_file['id'] = zeroshot_res.index
new_file['soft_pred'] = zeroshot_res['soft_label_prediction']
new_file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_soft.tsv'
new_file.to_csv(new_file_path, sep='\t', header=False, index=False)
with open(new_file_path, 'w') as file:
  file.write(new_file)

NameError: name 'zeroshot_res' is not defined

In [None]:
fewshot_res = run_prompt_fewshot(df)
#fewshot_res = check_output(fewshot_res)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['soft_label_prediction'] = ''
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['soft_label_prediction'] = predictions


In [None]:
print(fewshot_res.head())
new_file_fs = pd.DataFrame({
    'id': fewshot_res.index,
    'soft_pred' : fewshot_res['soft_label_prediction']})
new_file_fs_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS.tsv'
new_file_fs.to_csv(new_file_fs_path, sep='\t', header=False, index=False)


                                                text  \
2  {'post': 'Guy is my longtime fb friend. Very m...   
5  {'post': 'Horses is a fucking garbage song, wh...   
6  {'post': '@USER More Life FB!', 'reply': '@USE...   
7  {'post': 'Bring on transfer rumours atleast #L...   
9  {'post': 'Yellow?', 'reply': 'If your pastry i...   

                 soft_label     soft_label_prediction  
2  {'0.0': 0.4, '1.0': 0.6}  {'0.0': 0.7, '1.0': 0.3}  
5    {'0.0': 1.0, '1.0': 0}  {'0.0': 0.2, '1.0': 0.8}  
6    {'0.0': 1.0, '1.0': 0}  {'0.0': 0.9, '1.0': 0.1}  
7  {'0.0': 0.8, '1.0': 0.2}  {'0.0': 0.7, '1.0': 0.3}  
9  {'0.0': 0.4, '1.0': 0.6}  {'0.0': 0.2, '1.0': 0.8}  


In [15]:
import pandas as pd
new_file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS.tsv'
fewshot_results = pd.read_csv(new_file_path, sep='\t', header=None)
# Correcting the outputs to list format
test_df = pd.DataFrame([(2,5),({'0.0':0.3, '1.0':0.7}, {'0.0':0.1, '1.0':0.9})]).T
def fix_output(df, colname):
  predictions = []
  for item in df[colname]:
    if colname == 'soft_label':
      try:
        keys = list(item.keys())
        itemlist = [item.get(keys[0]), item.get(keys[1])]
        predictions.append(itemlist)
      except KeyError:
        pass
    else:
      item = json.loads(item)
      try:
        keys = list(item.keys())
        values = item[keys[0]]
        itemlist = [values[0], values[1]]
        print(itemlist)
        predictions.append(itemlist)
      except KeyError:
        pass
  df[1] = predictions
for index, item in fs_new['soft_label_prediction'].items():
  try:
    fixed_string = item.replace("'", '"')
    fs_new.loc[index, 'soft_label_prediction'] = fixed_string
  except:
    pass
#fix_output(fewshot_results)
#fewshot_results.to_csv(new_file_path, sep='\t', header=False, index=False)

# Error analysis

In [None]:
import pandas as pd
#new_file_path_FS = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS.tsv'
new_file_path_FS = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS2nd.tsv'

fewshot_results = pd.read_csv(new_file_path_FS, sep='\t', header=None)
#new_file_path_ZS = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_soft.tsv'
new_file_path_ZS = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_soft2nd.tsv'
zeroshot_results = pd.read_csv(new_file_path_ZS, sep='\t', header=None)
file_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/MP_dev.json'
fix_output(fewshot_results, 1)
fix_output(zeroshot_results, 1)
with open(file_path, 'r') as file:
  data = json.load(file)
df = pd.DataFrame(data)
df = df.transpose()
fix_output(df, 'soft_label')

print(fewshot_results.head())
print(zeroshot_results.head())

In [None]:
print(fewshot_results.head())
print(zeroshot_results.head())
print(df.head())

   0                           1
0  2  [0.7, 0.30000000000000004]
1  5                  [0.3, 0.7]
2  6  [0.9, 0.09999999999999998]
3  7                  [0.3, 0.7]
4  9                  [0.2, 0.8]
   0                           1
0  2                  [0.3, 0.7]
1  5                  [0.2, 0.8]
2  6  [0.9, 0.09999999999999998]
3  7                  [0.2, 0.8]
4  9                  [0.2, 0.8]
   annotation task                                               text  \
2  irony detection  {'post': 'Guy is my longtime fb friend. Very m...   
5  irony detection  {'post': 'Horses is a fucking garbage song, wh...   
6  irony detection  {'post': '@USER More Life FB!', 'reply': '@USE...   
7  irony detection  {'post': 'Bring on transfer rumours atleast #L...   
9  irony detection  {'post': 'Yellow?', 'reply': 'If your pastry i...   

                     annotators number of annotations  \
2  Ann0,Ann20,Ann59,Ann62,Ann63                     5   
5   Ann0,Ann7,Ann13,Ann17,Ann43                    

In [None]:
# Messages that could be viewed as non-ironic have a soft label of: [>0.75:<0.25]
# Messages that could be viewed as ironic have a soft label of: [<0.25:>0.75]
# Messages that could be viewed as debatable have a soft label of: [0.25-0.75]
zeroshot_nonironic = zeroshot_results[zeroshot_results[1].apply(lambda x: x[0] > 0.75)]
zeroshot_ironic = zeroshot_results[zeroshot_results[1].apply(lambda x: x[0] < 0.25)]
zeroshot_debatable = zeroshot_results[zeroshot_results[1].apply(lambda x: x[0] >= 0.25 and x[0] <= 0.75)]
fewshot_nonironic = fewshot_results[fewshot_results[1].apply(lambda x: x[0] > 0.75)]
fewshot_ironic = fewshot_results[fewshot_results[1].apply(lambda x: x[0] < 0.25)]
fewshot_debatable = fewshot_results[fewshot_results[1].apply(lambda x: x[0] >= 0.25 and x[0] <= 0.75)]
print(df.columns)
df_nonironic = df[df[1].apply(lambda x: x[0] > 0.75)]
df_ironic = df[df[1].apply(lambda x: x[0] < 0.25)]
df_debatable = df[df[1].apply(lambda x: x[0] >= 0.25 and x[0] <= 0.75)]

Index([      'annotation task',                  'text',
                  'annotators', 'number of annotations',
                 'annotations',            'soft_label',
                       'split',                  'lang',
                  'other_info',                       1],
      dtype='object')


In [None]:
import statistics
zs_nonironic = [round(statistics.fmean(zeroshot_nonironic[1].apply(lambda x: x[0])), 2),
                round(statistics.fmean(zeroshot_nonironic[1].apply(lambda x: x[1])), 2)]
zs_ironic = [round(statistics.fmean(zeroshot_ironic[1].apply(lambda x: x[0])), 2),
             round(statistics.fmean(zeroshot_ironic[1].apply(lambda x: x[1])), 2)]
zs_debatable = [round(statistics.fmean(zeroshot_debatable[1].apply(lambda x: x[0])),2),
                round(statistics.fmean(zeroshot_debatable[1].apply(lambda x: x[1])),2)]
fs_nonironic = [round(statistics.fmean(fewshot_nonironic[1].apply(lambda x: x[0])), 2),
                round(statistics.fmean(fewshot_nonironic[1].apply(lambda x: x[1])),2)]
fs_ironic = [round(statistics.fmean(fewshot_ironic[1].apply(lambda x: x[0])),2),
             round(statistics.fmean(fewshot_ironic[1].apply(lambda x: x[1])),2)]
fs_debatable = [round(statistics.fmean(fewshot_debatable[1].apply(lambda x: x[0])),2),
                round(statistics.fmean(fewshot_debatable[1].apply(lambda x: x[1])),2)]
ds_nonironic = [round(statistics.fmean(df_nonironic[1].apply(lambda x: x[0])),2),
                round(statistics.fmean(df_nonironic[1].apply(lambda x: x[1])),2)]
ds_ironic = [round(statistics.fmean(df_ironic[1].apply(lambda x: x[0])),2),
             round(statistics.fmean(df_ironic[1].apply(lambda x: x[1])),2)]
ds_debatable = [round(statistics.fmean(df_debatable[1].apply(lambda x: x[0])),2),
                round(statistics.fmean(df_debatable[1].apply(lambda x: x[1])),2)]
comp_data = {'Non-ironic': [zs_nonironic, fs_nonironic, ds_nonironic],
             'Ironic': [zs_ironic, fs_ironic, ds_ironic],
             'Debatable': [zs_debatable, fs_debatable, ds_debatable]}
comp_df = pd.DataFrame(comp_data, index=['Zero Shot', 'Few Shot', 'Dataset'])
print(comp_df)
count_data = {'Non-ironic': [zeroshot_nonironic.shape[0], fewshot_nonironic.shape[0], df_nonironic.shape[0]],
              'Ironic': [zeroshot_ironic.shape[0], fewshot_ironic.shape[0], df_ironic.shape[0]],
              'Debatable': [zeroshot_debatable.shape[0], fewshot_debatable.shape[0], df_debatable.shape[0]]}
count_df = pd.DataFrame(count_data, index=['Zero Shot', 'Few Shot', 'Dataset'])
print(count_df)

             Non-ironic        Ironic     Debatable
Zero Shot  [0.88, 0.12]  [0.19, 0.81]  [0.36, 0.64]
Few Shot   [0.87, 0.13]  [0.19, 0.81]  [0.39, 0.61]
Dataset    [0.93, 0.07]    [0.1, 0.9]  [0.53, 0.47]
           Non-ironic  Ironic  Debatable
Zero Shot         511     931       1563
Few Shot          834     276       1895
Dataset          1453     386       1166


The zero shot approach shows the most debatable counts. It would be best for the zero shot and few shot approach to become more confident in the non-ironic and ironic cases (as we see, the dataset shows a high agreement in non-ironic and ironic cases). Another finding could be that the annotators had high disagreement in debatable cases, while the LLM leans more towards the ironic side (0.61 and 0.63 for Zero shot and Few shot respectively).

As result of the first runs for Zero shot and Few shot, I will adjust the prompt for the Zero-shot strategy. Then I will use the same updated prompt for the Few-shot prompt and use a total of 10 prompts (3 Ironic, 3 Non-ironic and 4 Debatable).

In [9]:
new_prompt_zs = f"""You will be given a short message containing a post and a reply. Your task is to assess
whether the reply is ironic or non-ironic in context to the post. Irony can be subjective and sometimes
ambiguous. In cases where the tone of the reply is clear, reflect high certainty in your probability distribution.
When the tone of the reply is more ambiguous or debatable, reflect this uncertainty in your probability distribution as well.
Your task:
Return a probability distribution in an output format of a JSON object with one key: "probability".
The value represents a list of two elements. The first element being the value of non_ironic and the second element being ironic.
The values must be between 0 and 1.
The values must sum up to 1.
Example Output:
{{"probability": [0.5, 0.5]}}
Analyze the text and return the output as mentioned above:
"""
#zs_new = run_prompt_zeroshot(df, new_prompt_zs)
new_prompt_fs = f"""You will be given a short message containing a post and a reply. Your task is to assess
whether the reply is ironic or non-ironic in context to the post. Irony can be subjective and sometimes
ambiguous. In cases where the tone of the reply is clear, reflect high certainty in your probability distribution.
When the tone of the reply is more ambiguous or debatable, reflect this uncertainty in your probability distribution as well.

Your task:
Return a probability distribution in an output format of a JSON object with one key: "probability".
The value represents a list of two elements. The first element being the value of non_ironic and the second element being ironic.
The values must be between 0 and 1.
The values must sum up to 1.
Example Output:
{{"probability": [0.5, 0.5]}}
Analyze the text and return the output as mentioned above:
"""
fs_new = run_prompt_fewshot(df, new_prompt_fs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  results['soft_label_prediction'] = ''


In [None]:
#print(zs_new.head())
#print(fs_new.head())
def check_output(df):
  predictions = []
  for item in df['soft_label_prediction']:
    pred_ = json.loads(item)
    try:
      predictions.append([pred_, pred_])
    except KeyError:
      pass
  df['soft_label_prediction'] = predictions
  return df
#check_output(zs_new)
#check_output(fs_new)
#print(zs_new.head())
#print(fs_new.head())

In [None]:
new_file_zs = pd.DataFrame(index=zs_new.index)
new_file_zs['id'] = zs_new.index
new_file_zs['soft_pred'] = zs_new['soft_label_prediction']
new_file_zs_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_soft2nd.tsv'
new_file_zs.to_csv(new_file_zs_path, sep='\t', header=False, index=False)

new_file_fs = pd.DataFrame(index=fs_new.index)
new_file_fs['id'] = fs_new.index
new_file_fs['soft_pred'] = fs_new['soft_label_prediction']
new_file_fs_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS2nd.tsv'
new_file_fs.to_csv(new_file_fs_path, sep='\t', header=False, index=False)

  id                   soft_pred
2  2                  [0.3, 0.7]
5  5                  [0.2, 0.8]
6  6  [0.9, 0.09999999999999998]
7  7                  [0.2, 0.8]
9  9                  [0.2, 0.8]


In [16]:
print(fs_new.head())
fewshot2 = fs_new.copy()
fix_output(fewshot2, 'soft_label_prediction')
print(fs_new.head())
print(fewshot2.head())

                                                text  \
2  {'post': 'Guy is my longtime fb friend. Very m...   
5  {'post': 'Horses is a fucking garbage song, wh...   
6  {'post': '@USER More Life FB!', 'reply': '@USE...   
7  {'post': 'Bring on transfer rumours atleast #L...   
9  {'post': 'Yellow?', 'reply': 'If your pastry i...   

                 soft_label        soft_label_prediction  \
2  {'0.0': 0.4, '1.0': 0.6}  {"probability": [0.8, 0.2]}   
5    {'0.0': 1.0, '1.0': 0}  {"probability": [0.3, 0.7]}   
6    {'0.0': 1.0, '1.0': 0}    {"probability": [1.0, 0]}   
7  {'0.0': 0.8, '1.0': 0.2}  {"probability": [0.5, 0.5]}   
9  {'0.0': 0.4, '1.0': 0.6}  {"probability": [0.3, 0.7]}   

                             1  
2  {"probability": [0.8, 0.2]}  
5  {"probability": [0.3, 0.7]}  
6    {"probability": [1.0, 0]}  
7  {"probability": [0.5, 0.5]}  
9  {"probability": [0.3, 0.7]}  
[0.8, 0.2]
[0.3, 0.7]
[1.0, 0]
[0.5, 0.5]
[0.3, 0.7]
[0.3, 0.7]
[1.0, 0]
[0.8, 0.2]
[1.0, 0]
[0.8, 0.2]


In [17]:
new_file_fs = pd.DataFrame(index=fewshot2.index)
fewshot2['soft_label_prediction'] = fewshot2[1]
new_file_fs['id'] = fewshot2.index
new_file_fs['soft_pred'] = fewshot2['soft_label_prediction']
new_file_fs_path = '/content/drive/My Drive/data_practice_phase/data_practice_phase/MP/Results/MP_dev_softFS4th.tsv'
new_file_fs.to_csv(new_file_fs_path, sep='\t', header=False, index=False)