## Batch 3 (2022-04-21)
An issue was detected where the intial batch appeared to use a different version of `randomized_example_ids.p` This batch (and part of the subsequent batch) will rectify this by prioiritizing the example_ids that should have already been assigned.

In [37]:
import os
import random
import datetime as dt
import pickle
import pandas as pd

In [52]:
repo_dir = "/Users/ameliachu/repos/nlu-reddit-toxicity-dataset"
labelled_data_dir = f"{repo_dir}/data/labelled/"

# Collecting all the labelled file names 
labelled_data_fnames = [f for f in os.listdir(labelled_data_dir)]

# Pre-randomized list of example_ids
randomized_example_ids_path = f"{repo_dir}/data/randomized_example_ids.p"

# Dictionary of example_ids that have been labelled once
# See assign-examples-for-interrater.ipynb for more details
interrater_assignment_path = f"{repo_dir}/data/interrater-reliability/interrater_assignment_2022-04-21.p"

print(labelled_data_fnames)
print(randomized_example_ids_path)

['yj2369_labelling_assignment_2022-04-13.csv', 'gm2858_labelling_assignment_2022-04-09.csv', 'yp2201_labelling_assignment_2022-04-09.csv', 'ac4119_labelling_assignment_2022-04-09.csv', 'ac4119_labelling_assignment_2022-04-13.csv', 'yp2201_labelling_assignment_2022-04-13.csv', 'gm2858_labelling_assignment_2022-04-13.csv', 'yj2369_labelling_assignment_2022-04-09.csv']
/Users/ameliachu/repos/nlu-reddit-toxicity-dataset/data/randomized_example_ids.p


Reading pickles to inform example assignment

In [53]:
example_indices = pickle.load( open(randomized_example_ids_path, "rb" ) )
interrater_assignment = pickle.load( open(interrater_assignment_path, "rb" ) )

### Determining the example_ids that should have labelled

Obtaining a full list of all the example_ids that have currently been labelled.

In [4]:
selected_columns = ['example_id']
list_of_example_ids = [pd.read_csv(f"{labelled_data_dir}{fname}")[selected_columns] for fname in labelled_data_fnames]

In [11]:
list_of_example_ids.append(pd.read_csv(f"{repo_dir}/data/to_label/yp2201_labelling_assignment_2022-04-13")[selected_columns])

In [12]:
example_ids_pd = pd.concat(list_of_example_ids)

In [59]:
len(example_ids_pd), example_ids_pd.nunique()[0]

(1361, 1358)

In [16]:
ids_that_should_have_labels = example_indices[:1360]

In [19]:
to_label_immediately = set(ids_that_should_have_labels) - set(example_ids_pd.example_id.values)

In [21]:
len(to_label_immediately)

597

Grabbing the index of each example_id we are missing, so that we can replace the example_ids 
in the correct order

In [30]:
indices_to_replace = [example_indices.index(i) for i in to_label_immediately]
ordered_indices = list(set(indices_to_replace))
missing_example_ids = [example_indices[i] for i in ordered_indices]

In [31]:
print(missing_example_ids[:5])

[375030, 200577, 329502, 511720, 52314]


Defining  raters who need to be assigned a batch. 

In [None]:
rater_ids = ['ac4119', 'gm2858', 'yj2369','yp2201']
num_raters = len(rater_ids)

Defining the starting index, and the number of examples to include per batch

In [None]:
start_id = 0
batch_size = 140

In [34]:
supplmental_batches = []

for i in range(num_raters):
    end_id = start_id + batch_size
    supplmental_batches.append((start_id, end_id))
    start_id = end_id

In [35]:
print(supplmental_batches)

[(0, 140), (140, 280), (280, 420), (420, 560)]


Randomizing the order of raters and assigning batches based on order.

In [38]:
random.shuffle(rater_ids)

In [39]:
assign_batches_to_raters =  list(zip(rater_ids,supplmental_batches))
print(assign_batches_to_raters)

[('ac4119', (0, 140)), ('yj2369', (140, 280)), ('yp2201', (280, 420)), ('gm2858', (420, 560))]


Saving to pickle the remaining example_ids that require backfilling for the next batch 

In [42]:
remaining_examples_to_backfill = missing_example_ids[560:]

In [43]:
len(remaining_examples_to_backfill)

37

In [47]:
remaining_backfill_path = f'{repo_dir}/data/backfill_example_ids.p'
# pickle.dump(remaining_examples_to_backfill, open(remaining_backfill_path, "wb" ) )

### Reading in pre-collected dataset

In [40]:
daily_discussion_data_fname = "daily_master_data_1614250838_1618692612.csv"
daily_discussion_data_path = f"{repo_dir}/data/{daily_discussion_data_fname}"
daily_discussion_data = pd.read_csv(daily_discussion_data_path).rename(columns={'Unnamed: 0':'example_id'})

### Generating files for labelling based on batch assignment

This chunk uses the `rater_id` and assigned indices `(start_ind, end_ind)` as inputs. The process collects the relevant data and generates a file that is more conducive to labelling and text classification training/scoring. Specifically, the below collects the context (i.e. `preceding_comment`, `following_comment`), the `comment_for_evaluation`, and adds columns for each toxic attribute label.

In [55]:
to_label_dir = f"{repo_dir}/data/to_label"
current_date = dt.date.today()
labels = ['toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity', 'threat']

for rater_id, (start_ind, end_ind) in assign_batches_to_raters:
    fname = f"{rater_id}_labelling_assignment_{current_date}"
    export_location = f"{to_label_dir}/{fname}"
    assigned_indices = missing_example_ids[start_ind:end_ind]
    assigned_indices += interrater_assignment[rater_id]
    index_map = {}
    required_examples = []
    for ind in assigned_indices:
        ind_examples = [{
        'example_id': str(ind),
        'example_type': 'preceding',
        'body': daily_discussion_data[daily_discussion_data['example_id'] == ind-1]['body'].values[0]
    }, {
        'example_id': str(ind),
        'example_type': 'example',
        'body': daily_discussion_data[daily_discussion_data['example_id'] == ind]['body'].values[0]
    },
     {
        'example_id': str(ind),
        'example_type': 'following',
        'body': daily_discussion_data[daily_discussion_data['example_id'] == ind+1]['body'].values[0]
    }]
        required_examples += ind_examples 
    assigned_data = pd.DataFrame(required_examples)
    # Adding in the primary example info and the type of example for each row.
    assigned_data = assigned_data[['example_type','example_id','body']].reset_index()
    assigned_examples_pivot = assigned_data.pivot(index='example_id', columns='example_type', values='body').reset_index()[['example_id','preceding', 'example','following']]
    assigned_examples = assigned_examples_pivot.rename(columns={
         'preceding':'preceding_comment',
         'following':'following_comment',
         'example':'comment_for_evaluation'})
    for label in labels:
         assigned_examples[label] = ""
    assigned_examples.to_csv(export_location, index=False)
    

In [56]:
len(assigned_examples)

260