### Interrater Reliability Assement Example Selection

This notebook helps us randomly select examples to use for our initial interrater reliability assessments.
Because our dataset is mostly non-toxic, we selected examples in a semi-random fashion. 

We use `.sample()` from `pandas` to randomly surface 10 examples at a time. If we believe the comment could be fall into one or more of our labelling categories.

#### Required Libraries

In [5]:
import pandas as pd

In [6]:
pd.set_option('max_colwidth', None) # Setting to be able to view teh entire comment

#### Importing Dataset

In [7]:
# Replace this location with where the repo is.
repository_location = "/Users/ameliachu/repos/nlu-reddit-toxicity-dataset"

In [8]:
full_dataset_comments_path = f"{repository_location}/data/gme_master_data_1611233441_1618627983.csv"

In [9]:
full_dataset = pd.read_csv(full_dataset_comments_path)

#### Looking at Examples Randomly Sampled 10 at a Time

In [8]:
random_sample_10 = full_dataset[['body']].sample(10)
random_sample_10

Unnamed: 0,body
105744,GIVE IT UP FOR THE POWER HALF HOUR
102746,or each other‚Äôs nuts. no homo
8819,If you bought chicken Tendies for 5 dollars then some came up and offered you 2 dollars for them would you sell?
160776,I smell that smell. That smelly smell of something smelly is happening.
59953,Nice
29754,He doesn‚Äôt own any shares man
2484,Thought I was going to double my GME shares today. I was wrong.\n\n&#x200B;\n\nI tripled them. Lets fucking go.
59269,...that‚Äôs not a thing
22515,If you bought GME at over 300...lol
69724,Posture check


In [68]:
# Exploring examples by keyword
keyword = 'jerk'
full_dataset[full_dataset['body'].str.contains(keyword , na=False)][['body']].sample(10)

Unnamed: 0,body
101291,"hey guy with 10,000 share sell order at $26.65: fuck you"
239690,Yep day trader SEC would come fuck your sister and take your money.
217190,Has anyone ever exercised call on robinhood? Anyway to avoid getting fucked?... seems like good time to exercise ...? Or not worth in on robinhood ? Any opinions about this would be appreciated :)\n\nEdit to say... fuck you to whoever downvote. I didn‚Äôt get fucked with gme last time. I been here to support and that‚Äôs it.
12102,Show me how the fuck you close all those 100-120 short ladders now you hf assholes!\n\nHalted again!
135906,"Robbing my BB gains to buy more. It ain't much but, fuck you Shitron."
58992,Nope. \n\nAnd fuck you too
54305,Your account is brand new and you only have anti GME comments in these threads. Go fuck yourself.
134168,Holy fuck you're right. The wave sizes swapped sides
11707,Selling because your wife's boyfriend came over to fuck you is a bad idea. If you can hold his cock hold the stock pussy
279899,The play is to fuck your wife while you masturbate and cry in the corner Melvin


#### Samples identified as representative (potentially falling under one or more catergories)

In [38]:
viable_examples = [22273, 3, 115811, 248996, 187366, 201032, 288906,
                   288908, 6509, 220941, 71214, 157648, 62874, 5499, 
                   49308, 258175, 60073, 27500, 2147, 152987]

selected_examples = [115811, 6509, 3, 49308, 201032, 157648, 71214, 62874, 5499] # 136623,23120,288908,

In [39]:
set(viable_examples) - set(selected_examples)

{22273, 27500, 60073, 187366, 220941, 248996, 258175, 288906, 288908}

In [11]:
# full_dataset[['created_utc', 'sub_id','body']].iloc[selected_examples]

#### Obtaining the Preceding and Following Comment

We are including the preceding and following comment as context for our labelling evaluation. We do this by sorting the dataset by submission (`sub_id`) and time (`created_utc`).

In [26]:
sorted_full_dataset = full_dataset[['created_utc', 'sub_id','body']]\
.sort_values(by=['created_utc'])\
.reset_index(False)

In [27]:
sorted_full_dataset.head(10)

Unnamed: 0,index,created_utc,sub_id,body
0,103083,1611233000.0,l1xtan,üíéüôåüèº in üá©üá™
1,105008,1611233000.0,l1xtan,F**ck Andrew Left and Citron for their market manipulation.
2,113385,1611233000.0,l1xtan,Hi üíéü§≤
3,117004,1611233000.0,l1xtan,Here we go.
4,104623,1611233000.0,l1xtan,Do not fuckin sell bitches!
5,104796,1611233000.0,l1xtan,Is today our battle of the bastards? Hopefully it doesn't end as badly as that series....
6,119036,1611233000.0,l1xtan,I am here
7,113386,1611233000.0,l1xtan,LETS GOOOO
8,114987,1611233000.0,l1xtan,Mods keep removing my [retarded GME meme.](https://www.reddit.com/r/wallstreetbets/comments/l1xjae/gme_gang_vs_melvin_capital_ft_udeepfuckingvalue/) Enjoy the autism.
9,109611,1611234000.0,l1xtan,So we‚Äôre all just going to reload after the post-call dip? Yeah?!? Good. I‚Äôm in. GME TDM üöÄüöÄüöÄüöÄ


For each selected example, we use the new indices (sorted by `sub_id` and `created_utc` ) to label each preceding and following comment. The resulting map will be used to create a pivot table with all the selected examples and related info.

In [69]:
selected_examples = [187366, 288906, 60073, 54305, 217190, 288908]

In [70]:
index_map = {}

for ind in selected_examples:
    new_ind = sorted_full_dataset[sorted_full_dataset['index']==ind].index.values.astype(int)[0]
    example_id = ind
    
    index_map[new_ind-1] = {
        'example_id': str(ind),
        'type': 'preceding'
    }
    index_map[new_ind] = {
        'example_id': str(ind),
        'type': 'example'
    }
    index_map[new_ind+1] = {
        'example_id': str(ind),
        'type': 'following'
    }

In [71]:
row_indices_needed = list(index_map.keys())
# sorted_full_dataset.iloc[row_indices_needed]

In [72]:
# Placing the time and sub_id index into a column
sorted_full_dataset['new_index'] = sorted_full_dataset.index

In [73]:
# Adding in the primary example info and the type of example for each row.
sorted_full_dataset['example_type'] = sorted_full_dataset.apply(lambda x: index_map.get(x['new_index'], {}).get("type"), axis=1)
sorted_full_dataset['example_id'] = sorted_full_dataset.apply(lambda x: index_map.get(x['new_index'], {}).get("example_id"), axis=1)

In [74]:
# Narrowing down dataframe to the selected examples and columns needed.
selected_examples_df = sorted_full_dataset[sorted_full_dataset['new_index'].isin(row_indices_needed)][['example_type','example_id','body']]

In [75]:
selected_examples_df

Unnamed: 0,example_type,example_id,body
65069,preceding,54305,All apes on deck. Sale price!!!
65070,example,54305,Your account is brand new and you only have anti GME comments in these threads. Go fuck yourself.
65071,following,54305,"Like it or not, this place is a part of history now, a glowing ember of a long-enduring disdain for the mega-rich. I do really wish this subreddit the best as it transitions back to business and loss porn as usual. And I also wish that people realize that we have won so we can move on to other means of, well, sticking it to the man."
88114,preceding,60073,Unable to buy GME on either Revolut or Sharesies due to hem sharing the same US partner 'DriveWealth'. Why do we bother saying 'I like the stock' while big companies clearly work together to fuck us over.
88115,example,60073,Fuck off cunt. Take your GME shares and shove them up your ass.
88116,following,60073,Can‚Äôt lose anything if you don‚Äôt sell üöÄüíé .HOLD.
129401,preceding,288906,I'll have that with a strawberry shake and large fries
129402,example,288906,"listen retard, dont you get it, I literally meant that only PLTR was literally the only green stock on both the NYSE and NASDQ, and that I was wrong, and that thank christ the retard above caught me. \n\n\nFortunately now he can save us from making bad financial decisions."
129403,following,288906,Gathered 5 more shares since Monday. Slow and steady
139913,preceding,288908,"https://twitter.com/ryancohen/status/1367555194196787206\n\n[""I like your shorts, buddy""](https://www.youtube.com/watch?v=gxH3UgCW9Cg)"


In [78]:
# Pivoting the required data so that it is easier for Raters to Assess.
selected_examples_pivot_table = selected_examples_df.pivot(index='example_id', columns='example_type', values='body').reset_index()[['example_id','preceding', 'example','following']]

In [79]:
selected_examples_pivot_table 

example_type,example_id,preceding,example,following
0,187366,"the best time to buy was at opening, the second best time to buy is right now.\n\nnaturally this is a casino so bet what you are willing to lose",WORDS CANNOT DESCRIBE HOW SPECTACULARLY ERECT I AM,10k I'd be able to buy a house and set up my retirement. I'd have to work for another 20 or so years but I can live with that.
1,217190,You need to stop trading after your ambien dude.,Has anyone ever exercised call on robinhood? Anyway to avoid getting fucked?... seems like good time to exercise ...? Or not worth in on robinhood ? Any opinions about this would be appreciated :)\n\nEdit to say... fuck you to whoever downvote. I didn‚Äôt get fucked with gme last time. I been here to support and that‚Äôs it.,One way or another there are going to be some massive bagholders from today: Either the hedge funds or the retards. Great viewing either way.
2,288906,I'll have that with a strawberry shake and large fries,"listen retard, dont you get it, I literally meant that only PLTR was literally the only green stock on both the NYSE and NASDQ, and that I was wrong, and that thank christ the retard above caught me. \n\n\nFortunately now he can save us from making bad financial decisions.",Gathered 5 more shares since Monday. Slow and steady
3,288908,"https://twitter.com/ryancohen/status/1367555194196787206\n\n[""I like your shorts, buddy""](https://www.youtube.com/watch?v=gxH3UgCW9Cg)","Oh shit, yea that sucks. Hope your calls aren‚Äôt expiring soon.",I'm getting hard
4,54305,All apes on deck. Sale price!!!,Your account is brand new and you only have anti GME comments in these threads. Go fuck yourself.,"Like it or not, this place is a part of history now, a glowing ember of a long-enduring disdain for the mega-rich. I do really wish this subreddit the best as it transitions back to business and loss porn as usual. And I also wish that people realize that we have won so we can move on to other means of, well, sticking it to the man."
5,60073,Unable to buy GME on either Revolut or Sharesies due to hem sharing the same US partner 'DriveWealth'. Why do we bother saying 'I like the stock' while big companies clearly work together to fuck us over.,Fuck off cunt. Take your GME shares and shove them up your ass.,Can‚Äôt lose anything if you don‚Äôt sell üöÄüíé .HOLD.


In [80]:
# Saving the assessment to the data section
selected_examples_pivot_table.to_csv(f"{repository_location}/data/new_interater.csv", index=False)