Notebook analyzing the IAA between the three annotators on binary relevance. 

In [9]:
import os

import numpy as np
import pandas as pd
import statsmodels as sm

In [2]:
labeled_annotation_files = [x for x in os.listdir('annotation_files/labeled/') if 'sample' in x]
print(labeled_annotation_files)

['sample_for_relevance_annotation_YY.csv', 'alyssa_annotations - sample_for_relevance_annotation.tsv', 'sample_for_relevance_annotation_labeled_SK.csv']


In [3]:
annotation_df1 = pd.read_csv('annotation_files/labeled/sample_for_relevance_annotation_YY.csv')
print(annotation_df1.info())
annotation_df1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   url              100 non-null    object
 1   title            100 non-null    object
 2   subtitle         76 non-null     object
 3   text             100 non-null    object
 4   Relevance_Label  100 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 4.0+ KB
None


Unnamed: 0,url,title,subtitle,text,Relevance_Label
0,https://www.cleveland.com/reckon/2023/06/archi...,Archie Comics is ready to introduce its first ...,,People are making change and breaking down bar...,1
1,https://www.wkbn.com/sports/transgender-athlet...,Transgender athlete ban bill moves forward at ...,Lawmakers at the Ohio Statehouse voted on Wedn...,Watch a previous NBC4 report on House Bill 6 i...,1
2,https://kesq.com/news/2023/04/29/the-us-has-a-...,The US has a rich drag history. Here’s why the...,,"Scottie Andrew, CNN\n\nTo many, the stereotypi...",1
3,https://www.foxnews.com/media/teacher-calls-8t...,UK teacher calls 8th-grader 'despicable' for s...,A U.K. teacher at Rye College in East Sussex c...,A U.K. teacher got into a heated argument with...,1
4,https://www.nbcbayarea.com/news/national-inter...,Target Makes Changes to LGBTQ Merchandise for ...,Target is removing certain items from its stor...,Target is removing certain items from its stor...,1


In [4]:
annotation_df2 = pd.read_csv('annotation_files/labeled/sample_for_relevance_annotation_labeled_SK.csv')
print(annotation_df2.info())
annotation_df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   url              100 non-null    object
 1   title            100 non-null    object
 2   subtitle         76 non-null     object
 3   text             100 non-null    object
 4   Relevance_Label  100 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 4.0+ KB
None


Unnamed: 0,url,title,subtitle,text,Relevance_Label
0,https://www.cleveland.com/reckon/2023/06/archi...,Archie Comics is ready to introduce its first ...,,People are making change and breaking down bar...,1
1,https://www.wkbn.com/sports/transgender-athlet...,Transgender athlete ban bill moves forward at ...,Lawmakers at the Ohio Statehouse voted on Wedn...,Watch a previous NBC4 report on House Bill 6 i...,1
2,https://kesq.com/news/2023/04/29/the-us-has-a-...,The US has a rich drag history. Here’s why the...,,"Scottie Andrew, CNN\n\nTo many, the stereotypi...",1
3,https://www.foxnews.com/media/teacher-calls-8t...,UK teacher calls 8th-grader 'despicable' for s...,A U.K. teacher at Rye College in East Sussex c...,A U.K. teacher got into a heated argument with...,1
4,https://www.nbcbayarea.com/news/national-inter...,Target Makes Changes to LGBTQ Merchandise for ...,Target is removing certain items from its stor...,Target is removing certain items from its stor...,1


In [5]:
annotation_df3 = pd.read_csv('annotation_files/labeled/alyssa_annotations - sample_for_relevance_annotation.tsv', sep='\t')
annotation_df3.iloc[3, annotation_df3.columns.get_loc('Relevance_Label')] = 1.0
annotation_df3 = annotation_df3.drop('Unnamed: 5', axis=1)
annotation_df3['Relevance_Label'] = annotation_df3['Relevance_Label'].astype(int)
print(annotation_df3.info())
annotation_df3.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   url              100 non-null    object
 1   title            100 non-null    object
 2   subtitle         76 non-null     object
 3   text             100 non-null    object
 4   Relevance_Label  100 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 4.0+ KB
None


Unnamed: 0,url,title,subtitle,text,Relevance_Label
0,https://www.cleveland.com/reckon/2023/06/archi...,Archie Comics is ready to introduce its first ...,,People are making change and breaking down bar...,1
1,https://www.wkbn.com/sports/transgender-athlet...,Transgender athlete ban bill moves forward at ...,Lawmakers at the Ohio Statehouse voted on Wedn...,Watch a previous NBC4 report on House Bill 6 i...,1
2,https://kesq.com/news/2023/04/29/the-us-has-a-...,The US has a rich drag history. Here’s why the...,,"Scottie Andrew, CNN To many, the stereotypica...",1
3,https://www.foxnews.com/media/teacher-calls-8t...,UK teacher calls 8th-grader 'despicable' for s...,A U.K. teacher at Rye College in East Sussex c...,A U.K. teacher got into a heated argument with...,1
4,https://www.nbcbayarea.com/news/national-inter...,Target Makes Changes to LGBTQ Merchandise for ...,Target is removing certain items from its stor...,Target is removing certain items from its stor...,1


In [6]:
assert list(annotation_df1['url']) == list(annotation_df2['url']) == list(annotation_df3['url'])

In [12]:
raters_data = np.array([list(annotation_df1['Relevance_Label']),
                        list(annotation_df2['Relevance_Label']), 
                        list(annotation_df3['Relevance_Label'])]).T
print(raters_data.shape)

(100, 3)


In [19]:
inter_rater_stats_table = sm.stats.inter_rater.aggregate_raters(raters_data,
                                                                n_cat=2)

In [25]:
print('Fleiss Kappa value = ' + str(sm.stats.inter_rater.fleiss_kappa(inter_rater_stats_table[0], method='fleiss')))

Fleiss Kappa value = 0.6072013093289687


In [26]:
agg_rating_counts = inter_rater_stats_table[0]
print(agg_rating_counts.shape)

(100, 2)


In [30]:
num_perfect_agreement = 0
num_majority_say_irrelev = 0
num_majority_say_relev = 0
for i in range(agg_rating_counts.shape[0]):
    if 3 in agg_rating_counts[i]:
        num_perfect_agreement += 1
    else:
        if agg_rating_counts[i][0] == 2:
            num_majority_say_irrelev += 1
        elif agg_rating_counts[i][1] == 2:
            num_majority_say_relev += 1
print(num_perfect_agreement)
print(num_majority_say_irrelev)
print(num_majority_say_relev)

80
6
14


In [31]:
#within perfect agreement
num_relev_consensus = 0
num_irrelev_consensus = 0
for i in range(agg_rating_counts.shape[0]):
    if 3 in agg_rating_counts[i]:
        if agg_rating_counts[i][0] == 3:
            num_irrelev_consensus += 1
        elif agg_rating_counts[i][1] == 3:
            num_relev_consensus += 1
print(num_relev_consensus)
print(num_irrelev_consensus)

67
13


## Report on inter-annotator agreement analysis of internal relevance-classification task:

**The task:**

number of samples = 100

number of annotators = 3

number of categories = 2 (relevant or irrelevant)


**Findings:**

Fleiss Kappa value = 0.607 (indicating good or substantial agreement)

#Samples with perfect agreement or consensus (all 3 annotators choosing the same category = 80

Within samples without consensus (20/100): 

    #Samples with 2/3 annotators choosing irrelevant: 6
    
    #Samples with 2/3 annotators choosing relevant: 14

Within samples with consensus (80/100):

    #Samples labeled irrelevant: 13
    
    #Samples labeled relevant: 67


### Proposed paths forward

1. We use the 80 samples with perfect agreement or consensus, do a 50-50 train-test split (using 40 samples for training few shot classifier, then later 40 for just testing and reporting on that), and report the agreement, etc. for the 100-sample in an appendix. We concede that relevance-annotation is a somewhat subjective task.
2. We do all above, but use all 100 labeled samples for train-test split -- in the 20 cases with disagreement, we use the majority label as the label. 
3. We discuss the 20 samples that do not have consensus and discuss the disagreement, do some consensus-building, and do another round of 100-sample annotation (different sample) post-discussion. Multiple rounds of annotation and discussion to build consensus is also a popular strategy: we then report agreement, etc. on that next annotation iteration, and use those labeled samples instead. We may have a higher agreement or consensus-ratio, indicating that the relevance-annotation task is more or less objective?

Selected path after discussion with team: **Path 2**