## Selecting `example_ids` to enable Interrater Reliability Checking
This notebook is run after all data from a batch is labelled. It identifies `example_ids` that have been labeled by only one other rater so that they can be randomly selected to be included in the subsequent batch for interrater reliability checking.

We want to ensure that we are prioritizing examples that contain toxicity, and that each rater is receiving an equal amount of examples from each other rater.

To illustrate, if each batch contains 60 examples that have been previously rated once:
- The rater A will receive 20 examples each from raters B, C, D.
- Of these 20 examples, more than half of the examples contain at least one toxic attribute.

The resulting pickle file will contain the randomly assigned `example_ids` for each rater and be read in by the next batch assignment notebook.

In [102]:
import os
import pandas as pd
import pickle
import random
import datetime as dt

In [103]:
repo_dir = "/Users/ameliachu/repos/nlu-reddit-toxicity-dataset"
labelled_data_dir = f"{repo_dir}/data/labelled/"

Gathering file names of all labelled data

In [104]:
labelled_data_fnames = os.listdir(labelled_data_dir)
labelled_data_fnames

['yj2369_labelling_assignment_2022-04-13.csv',
 'gm2858_labelling_assignment_2022-04-21.csv',
 'gm2858_labelling_assignment_2022-04-09.csv',
 'yp2201_labelling_assignment_2022-04-09.csv',
 'yp2201_labelling_assignment_2022-04-21.csv',
 'ac4119_labelling_assignment_2022-04-09.csv',
 'ac4119_labelling_assignment_2022-04-21.csv',
 'ac4119_labelling_assignment_2022-04-13.csv',
 'yp2201_labelling_assignment_2022-04-13.csv',
 'gm2858_labelling_assignment_2022-04-13.csv',
 'yj2369_labelling_assignment_2022-04-09.csv',
 'yj2369_labelling_assignment_2022-04-21.csv']

### Placing all labelled data to-date in a dataframe and appending additional metadata needed.

In [105]:
labels = ['toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity', 'threat']
list_of_labelled_examples = []

for fname in labelled_data_fnames:
   
    labelled_data = pd.read_csv(f"{labelled_data_dir}{fname}")
    # Creating a summary column 'has_toxicity' to flag examples 
    # which contain at least one toxic attribute
    labelled_data['has_toxicity'] = labelled_data[labels].sum(axis=1)
    labelled_data['has_toxicity'] = labelled_data['has_toxicity'].apply(lambda x: 1 if x > 0 else 0)
    labelled_data['rater_id'] = fname.split("_")[0]
    labelled_data['assignment_date'] = fname.split("_")[-1][:-4]
    example_id_df = labelled_data[['assignment_date','example_id','has_toxicity','rater_id']]
    list_of_labelled_examples.append(example_id_df)

In [106]:
labelled_examples_lookup = pd.concat(list_of_labelled_examples)

In [107]:
len(set(labelled_examples_lookup[labelled_examples_lookup['has_toxicity']==1]['example_id'].values))

368

Creating `n_raters_lookup` to determine if each example has already had an interater check

In [108]:
n_raters_lookup = labelled_examples_lookup.groupby("example_id").rater_id.nunique().reset_index()

In [109]:
n_raters_lookup.columns = ['example_id', 'n_raters']

In [110]:
labelled_examples_lookup_2 = pd.merge(labelled_examples_lookup, n_raters_lookup, on=['example_id'])

Creating `selected_examples`, a dictionary that contains n randomly-selected `example_ids` of each `rater_id` x `has_toxicity` combination, based on the pre-defined `sample_ratio`.

`selected_examples` will be used to assign each rater examples for interrater reliability checking.

In [111]:
# creating pandas masks to simplify filtering
toxic_examples = labelled_examples_lookup_2['has_toxicity'] == 1
nontoxic_examples = labelled_examples_lookup_2['has_toxicity'] == 0
need_interrater = labelled_examples_lookup_2['n_raters'] <= 1

In [112]:
# Defining the raters
rater_ids = ['ac4119', 'gm2858', 'yj2369','yp2201']
n_raters = len(rater_ids)
n_other_raters = n_raters - 1

In [113]:
# Defining the sample ratio of examples with toxic attributes vs. samples without any.
sample_ratio = {
    'nontoxic': 30,
    'toxic': 10
}

In [120]:
labelled_examples_lookup_2[labelled_examples_lookup_2['has_toxicity']==1].groupby(['assignment_date','rater_id']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,example_id,has_toxicity,n_raters
assignment_date,rater_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-04-09,ac4119,51,51,51
2022-04-09,gm2858,30,30,30
2022-04-09,yj2369,30,30,30
2022-04-09,yp2201,28,28,28
2022-04-13,ac4119,48,48,48
2022-04-13,gm2858,24,24,24
2022-04-13,yj2369,31,31,31
2022-04-13,yp2201,16,16,16
2022-04-21,ac4119,61,61,61
2022-04-21,gm2858,55,55,55


In [114]:
toxic_df = labelled_examples_lookup_2[toxic_examples & need_interrater]
nontoxic_df = labelled_examples_lookup_2[nontoxic_examples & need_interrater]
selected_examples = {}

for rater in rater_ids:
    toxic = toxic_df[toxic_df['rater_id'] == rater]['example_id']
    nontoxic = nontoxic_df[nontoxic_df['rater_id'] == rater]['example_id']
    print(rater, len(toxic),len(nontoxic))
    
    toxic = toxic.sample(sample_ratio['toxic'] * (n_raters-1))
    nontoxic = nontoxic.sample(sample_ratio['nontoxic'] * (n_raters-1))
    
    selected_examples[rater] = {
        'toxic': list(toxic.values),
        'nontoxic':  list(nontoxic.values),
    }

ac4119 85 276
gm2858 31 328
yj2369 43 315
yp2201 30 329


### Determining the batch indices for "toxic" and "nontoxic" examples
This calculates the indices that should be assigned based on the total number of examples of each type.

In [92]:
n_batches = (n_raters - 1) * 2 # 2 = has_toxicity TRUE, FALSE

In [93]:
batch_indices = {
    "toxic": [],
    "nontoxic": []
}

In [94]:
toxic_start_id = 0
nontoxic_start_id = 0

for i in range(n_other_raters):
    toxic_end_id = toxic_start_id + sample_ratio['toxic']
    nontoxic_end_id = nontoxic_start_id + sample_ratio['nontoxic']
    batch_indices['toxic'].append((toxic_start_id, toxic_end_id))
    batch_indices['nontoxic'].append((nontoxic_start_id, nontoxic_end_id))
    toxic_start_id = toxic_end_id
    nontoxic_start_id = nontoxic_end_id

In [95]:
batch_indices

{'toxic': [(0, 10), (10, 20), (20, 30)],
 'nontoxic': [(0, 30), (30, 60), (60, 90)]}

### Determining batch assignment for each rater

Each rater will receive an equal amount of examples from each other rater, at the same toxic vs. non-toxic ratio. This determines how each labelled `rater_id` x `has_toxicity` subset should be split and randomly assigned to each `other_rater`. Then, it applies those and saves those assignments, producing `interrater_assignment`.
```
interrater_assignment = Dict(rater_id,list_of_example_ids)
```

In [96]:
interrater_assignment = {r:[] for r in rater_ids}

for rater in rater_ids:
    other_raters = [r for r in rater_ids if r!=rater]
    for example_type in ['toxic','nontoxic']:
        example_list = selected_examples[rater][example_type]
        sample_size = len(example_list)
        sample_ranges = batch_indices[example_type]
        random.shuffle(other_raters)
        assign_batches_to_raters = list(zip(other_raters, sample_ranges))
        print(f'{example_type} assignment order:', assign_batches_to_raters)
        for other_rater, (start_id, end_id) in assign_batches_to_raters:
            interrater_assignment[other_rater] += selected_examples[rater][example_type][start_id:end_id] 

toxic assignment order: [('gm2858', (0, 10)), ('yj2369', (10, 20)), ('yp2201', (20, 30))]
nontoxic assignment order: [('yp2201', (0, 30)), ('gm2858', (30, 60)), ('yj2369', (60, 90))]
toxic assignment order: [('yp2201', (0, 10)), ('yj2369', (10, 20)), ('ac4119', (20, 30))]
nontoxic assignment order: [('yp2201', (0, 30)), ('yj2369', (30, 60)), ('ac4119', (60, 90))]
toxic assignment order: [('yp2201', (0, 10)), ('gm2858', (10, 20)), ('ac4119', (20, 30))]
nontoxic assignment order: [('yp2201', (0, 30)), ('gm2858', (30, 60)), ('ac4119', (60, 90))]
toxic assignment order: [('gm2858', (0, 10)), ('yj2369', (10, 20)), ('ac4119', (20, 30))]
nontoxic assignment order: [('gm2858', (0, 30)), ('ac4119', (30, 60)), ('yj2369', (60, 90))]


Saving assignment...

In [97]:
interrater_assignment_path =  f"{repo_dir}/data/interrater-reliability/interrater_assignment_{dt.date.today()}.p"
print(interrater_assignment_path )
pickle.dump(interrater_assignment, open( interrater_assignment_path, "wb" ) )

/Users/ameliachu/repos/nlu-reddit-toxicity-dataset/data/interrater-reliability/interrater_assignment_2022-04-28.p
