## Selecting Initial Batch of 200 Examples to Label

In this notebook, we generate a randomized list of example_ids from the pre-collected dataset (`daily_master_data_1614250838_1618692612.csv`). This randomized list will be used to assign raters batchs of data to label. 

In [1]:
import pandas as pd
import random
import pickle
import datetime as dt

In [2]:
repo_dir = "/Users/ameliachu/repos/nlu-reddit-toxicity-dataset"

In [3]:
# Defining output directory/paths
randomized_example_ids_path = f"{repo_dir}/data/randomized_example_ids.p"
to_label_dir = f"{repo_dir}/data/to_label"

In [4]:
daily_discussion_data_fname = "daily_master_data_1614250838_1618692612.csv"
daily_discussion_data_path = f"{repo_dir}/data/{daily_discussion_data_fname}"

### Reading in pre-collected dataset

In [6]:
daily_discussion_data = pd.read_csv(daily_discussion_data_path).rename(columns={'Unnamed: 0':'example_id'})

In [5]:
daily_discussion_data.head(5)

Unnamed: 0,example_id,sub_id,created_utc,body,score,author
0,0,ls42x6,1614251000.0,first,7,I_make_switch_a_roos
1,1,ls42x6,1614251000.0,Rise and shine bitches,41,LitenVarg
2,2,ls42x6,1614251000.0,Here we go. 🚀,14,readingtostrangers
3,3,ls42x6,1614251000.0,GME to 420.69 EOD,14,wottsraja
4,4,ls42x6,1614251000.0,Second retard,2,AceSouth


In [9]:
num_examples = len(daily_discussion_data)
num_dst_example_ids = daily_discussion_data['example_id'].nunique()
print(f"Number of Examples: {num_examples:,}; Number of Example Ids: {num_dst_example_ids:,}")

Number of Examples: 619,646; Number of Example Ids: 619,646


Collecting the `example_id`s and randomizing order

In [11]:
example_indices = list(daily_discussion_data['example_id'].values)
print(example_indices[:5])

[0, 1, 2, 3, 4]


In [9]:
random.seed(519)
random.shuffle(example_indices)
print(example_indices[:5])

[494030, 420324, 473177, 419306, 506755]


In [10]:
pickle.dump(example_indices, open(randomized_example_ids_path , "wb" ) )

### Randomly assigning the intial batch (n=200)

Reading in the randomized `example_id` list

In [11]:
example_indices = pickle.load( open(randomized_example_ids_path, "rb" ) )

Defining  raters who need to be assigned a batch. 

In [14]:
rater_ids = ['ac4119', 'gm2858', 'yj2369','yp2201']
num_raters = len(rater_ids)

Defining the starting index, and the number of examples to include per batch

In [15]:
start_id = 0 
batch_size = 200

In [17]:
init_batches = []

for i in range(num_raters):
    end_id = start_id + batch_size
    init_batches.append((start_id,end_id))
    start_id = end_id

print(init_batches)

[(0, 200), (200, 400), (400, 600), (600, 800)]


Randomizing the order of raters and assigning batches based on order.

In [18]:
random.shuffle(rater_ids)

In [21]:
assign_batches_to_raters = list(zip(rater_ids,init_batches))

In [22]:
assign_batches_to_raters

[('gm2858', (0, 200)),
 ('yp2201', (200, 400)),
 ('yj2369', (400, 600)),
 ('ac4119', (600, 800))]

### Generating files for labelling based on batch assignment

This chunk uses the `rater_id` and assigned indices `(start_ind, end_ind)` as inputs. The process collects the relevant data and generates a file that is more conducive to labelling and text classification training/scoring. Specifically, the below collects the context (i.e. `preceding_comment`, `following_comment`), the `comment_for_evaluation`, and adds columns for each toxic attribute label.

*Note*: an issue was detected with this chunk after assigning the intial batch. This methodology, does not allow for same comments to appear in the same file (e.g. a comment that is both a `preceding_comment` and a `comment_for_evaluation`). This was rectified in subsequent batch assignments.

In [27]:
current_date = dt.date.today()
labels = ['toxicity', 'severe_toxicity', 'identity_attack', 'insult', 'profanity', 'threat']

for rater_id, (start_ind, end_ind) in assign_batches_to_raters:
    # init export file names
    fname = f"{rater_id}_labelling_assignment_{current_date}"
    export_location = f"{to_label_dir}/{fname}"
    # determining which example_ids are the comment_for_evaluation
    assigned_indices = example_indices[start_ind:end_ind]
    index_map = {}
    required_indices = []
    # Collecting the context for each comment_for_evaluation
    for ind in assigned_indices:
        index_map[ind-1] = {
        'example_id': str(ind),
        'type': 'preceding'
    }
        index_map[ind] = {
        'example_id': str(ind),
        'type': 'example'
    }
        index_map[ind+1] = {
        'example_id': str(ind),
        'type': 'following'
    }
        required_indices += [ind-1,ind, ind+1]
    # collecting all the example_ids needed for this batch 
    # (i.e. context & comment_for_evaluation)
    assigned_data = daily_discussion_data.iloc[required_indices]
    
    # Adding in the primary example info and the type of example for each row.
    assigned_data ['example_type'] = assigned_data .apply(lambda x: index_map.get(x['example_id'], {}).get("type"), axis=1)
    assigned_data ['example_id'] = assigned_data .apply(lambda x: index_map.get(x['example_id'], {}).get("example_id"), axis=1)
    assigned_data = assigned_data[['example_type','example_id','body']]
    
    # Pivoting the dataframe so that each example_type is its own column
    assigned_examples_pivot = assigned_data.pivot(index='example_id', columns='example_type', values='body').reset_index()[['example_id','preceding', 'example','following']]
    assigned_examples = assigned_examples_pivot.rename(columns={
         'preceding':'preceding_comment',
         'following':'following_comment',
         'example':'comment_for_evaluation'})
    
    # Adding in columns for each toxic attribute label
    for label in labels:
         assigned_examples[label] = ""
    # Write out file
    assigned_examples.to_csv(export_location, index=False)
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  assigned_data ['example_type'] = assigned_data .apply(lambda x: index_map.get(x['example_id'], {}).get("type"), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  assigned_data ['example_id'] = assigned_data .apply(lambda x: index_map.get(x['example_id'], {}).get("example_id"), axis=1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing

#### Verifying Output

In [24]:
assigned_data_example = pd.read_csv(export_location)

In [25]:
assigned_data_example.head(5)

Unnamed: 0,example_id,preceding_comment,comment_for_evaluation,following_comment,toxicity,severe_toxicity,identity_attack,insult,profanity,threat
0,105580,I'm shorting JPOW,It's going back up,I got into so many good spacs today 🦘,,,,,,
1,106714,Were going to need to change the banner at the...,I wish all the fucking leaf blowers would come...,Waiting to load up on NASDAQ CFDs at around 12...,,,,,,
2,107332,"Lol, this is still correction territory.\n\nWa...",I fucking hope not,JAY FIRE THE PRINTERS! JAY?! JAYYYYYYYY,,,,,,
3,11714,I did 2 years ago all cards went like times 10,All these fucks have paper hands,AMC!!!!!!,,,,,,
4,117374,Yeah wtf do you think “I was close” means?,I like sugar with my margaritas. Not salt.,I like alts on top of alts. And boy did I get it,,,,,,
