In [1]:
import datasets
import pandas as pd
human_data = datasets.load_dataset('boda/review_evaluation_human_annotation', name = 'combined_main_aspects', split = 'full').to_pandas()


In [2]:
import random

aspects = [ 'actionability', 'grounding_specificity','verifiability', 'helpfulness']

hard_examples = {}
for aspect in aspects:
    hard_label_column = f"{aspect}_label_type"
    hard_cases = human_data[human_data[hard_label_column] == "hard"]
    hard_examples[aspect] = hard_cases.sample(n=10, random_state=42)
    
    hard_examples[aspect]['labels'] = [x['labels'] for x in hard_examples[aspect][aspect]]
    hard_examples[aspect]['aspect'] = aspect
    columns_to_drop = [col for col in hard_examples[aspect].columns if any(x in col for x in aspects)]
    hard_examples[aspect] = hard_examples[aspect].drop(columns=columns_to_drop)
    
all_hard_examples = pd.concat(hard_examples.values(), ignore_index=True)

In [3]:
all_hard_examples

Unnamed: 0,review_point,paper_id,venue,focused_review,batch,id,labels,aspect
0,"1) ""However, there is no corresponding set of ...",ICLR_2021_1948,ICLR_2021,a. Anonymisation Failure in References\ni. A r...,3,87,"[2, 4, 1]",actionability
1,- There may be no need to distinguish between ...,GHaoCSlhcK,ICLR_2025,1. **Limited discussion of related works** on ...,5,439,"[5, 2, 1]",actionability
2,1. Although the authors derive PAC-Bayesian bo...,pwW807WJ9G,ICLR_2024,1. Although the authors derive PAC-Bayesian bo...,6,681,"[1, 3, 2]",actionability
3,- The authors claim it to be one of the prelim...,EtNebdSBpe,EMNLP_2023,- The paper is hard to read and somewhat diffi...,7,808,"[2, 3, 1]",actionability
4,- confidence in empirical findings: while the ...,1gqR7yEqnP,ICLR_2025,- strong overlap with non-cited work/lack of n...,10,1517,"[2, 5, 4]",actionability
5,* The figures are small and almost unreadable ...,NIPS_2017_262,NIPS_2017,that are addressed below:\n* Most of the theor...,6,641,"[5, 4, 3]",actionability
6,"- The types of situations/social norms (e.g., ...",vg55TCMjbC,EMNLP_2023,- Although the situations are checked by human...,5,364,"[4, 3, 2]",actionability
7,3. **Performance differences between methods a...,WzUPae4WnA,ICLR_2025,1. **The motivation of this paper appears to b...,6,578,"[2, 4, 1]",actionability
8,6. How about the comparison in terms of comput...,NIPS_2020_576,NIPS_2020,1. Although the problem studied in this paper ...,8,1131,"[4, 5, 3]",actionability
9,- The architecture used for the experiments is...,ICLR_2021_2892,ICLR_2021,- Proposition 2 seems to lack an argument why ...,9,1234,"[4, 1, 5]",actionability


In [4]:
chatgpt_errors = {}
for aspect in aspects:
 
    df1 = pd.read_excel(f'../chatgpt/outputs/main_data_batch_gold_results.xlsx', sheet_name=aspect)
    df2 = pd.read_excel(f'../chatgpt/outputs/main_data_batch_silver_results.xlsx', sheet_name=aspect)
    df= pd.concat([df1, df2], ignore_index=True)
    if aspect == 'verifiability':
        df[f'{aspect}_label'] = df[f'{aspect}_label'].replace('X', 10)
        df[f'chatgpt_{aspect}_score'] = df[f'chatgpt_{aspect}_score'].replace('X', 10)
    
    df[f'{aspect}_label'] = df[f'{aspect}_label'].astype(int)
    df[f'chatgpt_{aspect}_score'] = df[f'chatgpt_{aspect}_score'].astype(int)


    df['diff'] = abs(df[f'{aspect}_label'] - df[f'chatgpt_{aspect}_score'])
    filtered_df = df[df['diff'] >= 3]
    print(f"Aspect: {aspect}, Cases with diff >= 3: {len(filtered_df)}")
    chatgpt_errors[aspect] = filtered_df.sample(n=10, random_state=42)


    chatgpt_errors[aspect]['chatgpt_label'] = chatgpt_errors[aspect][f'chatgpt_{aspect}_score']
    chatgpt_errors[aspect]['chatgpt_rational'] = chatgpt_errors[aspect][f'chatgpt_{aspect}_rationale']

    chatgpt_errors[aspect]['human_label'] = chatgpt_errors[aspect][f'{aspect}_label']
    chatgpt_errors[aspect]['aspect'] = aspect

    columns_to_drop = [col for col in chatgpt_errors[aspect].columns if any(x in col for x in aspects)]
    chatgpt_errors[aspect] = chatgpt_errors[aspect].drop(columns=columns_to_drop)

all_chatgpt_errors = pd.concat(chatgpt_errors.values(), ignore_index=True)

Aspect: actionability, Cases with diff >= 3: 146
Aspect: grounding_specificity, Cases with diff >= 3: 54
Aspect: verifiability, Cases with diff >= 3: 231
Aspect: helpfulness, Cases with diff >= 3: 13


In [9]:
for x in all_chatgpt_errors['chatgpt_rational']:
    print(x)

The review point raises a question about the meaning of "100 steps" in the context of search models comparison, specifically asking if it refers to "100 sampled strategies." While the question implies that clarification is needed, it does not explicitly instruct the authors to provide this clarification in the text. The action is implicit and somewhat vague, as the authors can infer that they need to clarify this point but may not be entirely sure how to incorporate it into their draft. Therefore, the comment is barely actionable.
The review point raises a question about the prevalence of discourse relations in Table A2, asking whether this is due to colloquial language or a different use of "discourse" compared to other languages in Universal Dependencies (UD). While the comment highlights a potential issue, it does not provide explicit guidance or suggest specific actions for the authors to take. The authors might infer that they need to clarify or justify their use of discourse rela

In [6]:
with pd.ExcelWriter('outputs/error_analysis.xlsx') as writer:
    all_chatgpt_errors.to_excel(writer, sheet_name='ChatGPT Errors', index=False)
    all_hard_examples.to_excel(writer, sheet_name='Hard Examples', index=False)