## Setting Up

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os


In [19]:
pd.set_option('display.max_colwidth', 1000)

In [2]:
DATA_DIR = '../data/'
SEED = 12

## Clean and Prep Wiki Data

In [3]:
import pandas as pd

In [4]:
toxicity_annotated_comments = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxicity_annotations = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotations.tsv'), sep = '\t')

In [5]:
annotations_gped = toxicity_annotations.groupby('rev_id', as_index=False).agg({'toxicity': 'mean'})
all_data = pd.merge(annotations_gped, toxicity_annotated_comments, on = 'rev_id')

In [6]:
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))

# TODO(nthain): Consider doing regression instead of classification
all_data['is_toxic'] = all_data['toxicity'] > 0.5

In [7]:
# split into train, valid, test
wiki_splits = {}
for split in ['train', 'test', 'dev']:
    wiki_splits[split] = all_data.query('split == @split')

In [8]:
#for split in wiki_splits:
#    wiki_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s.csv' % split), index=False)

### Prep debiasing data

In [9]:
def augment_with_data(source_df, target_path, target_name, sep = '\t', write = False):
    target_df = pd.read_csv(target_path, sep = '\t')
    target_df['sample'] = target_name
    target_splits = {}
    for split in source_df:
        target_splits[split] = pd.concat([source_df[split],
                                          target_df.query('split == @split')]).sample(frac = 1, random_state = SEED)
        if write:
            target_splits[split].to_csv(os.path.join(DATA_DIR, 'wiki_%s_%s.csv' % (target_name, split)), index=False)
    return target_splits
    

In [10]:
debias_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data.tsv', 'debias')

In [11]:
wiki_splits['train'].shape

(95692, 9)

In [12]:
debias_splits['train'].shape

(99157, 9)

### Prep random data

In [13]:
random_splits = augment_with_data(wiki_splits, '../data/toxicity_debiasing_data_random.tsv', 'debias_random')

In [14]:
random_splits['train'].shape

(99157, 9)

In [21]:
random_splits['train'].query('sample == "random"')

Unnamed: 0,comment,is_toxic,logged_in,ns,rev_id,sample,split,toxicity,year
25577,"` == The Halo's RfA == {| |- | |valign=top|Mark Dingemanse...Thank you very much for the constructive criticism in your oppose comment in my request for adminship. Ultimately, no consensus was reached, and I failed to be promoted, but I am very grateful for your coments. I will strive to better myself in all areas, especially Mainspace. |} `",False,True,user,7.7269e+07,random,train,0.0,2006.0
150589,: You've changed your point several times. Sorry if I missed it.,False,True,article,6.42874e+08,random,train,0.0,2015.0
144969,"` *Neither admins nor non-admins should count votes, as you and I both already know. No actual case for either having primary topic was made. `",False,True,user,6.1052e+08,random,train,0.0,2014.0
41923,"It's interesting that this IP feels they know me so well, when I have had limited interaction with you. We have never actively engaged each other on talk pages. It's amazing you somehow have been able to edit every page I edit; every photograph you crop is mine. Have you wondered where this ire against me and my work has come from? You continually put a photo of a man in a wig (a photoshopped wig, at that) on the Afro page, and four different editors have removed it, five if you include me. This is what you consider ownership and edit warring? Then you have taken my photographs and decided to rename them simply to remove David Shankbone out of the file name. Your intentions are no pure. You are gaming the policies and guidelines to Wikistalk me. You say not to bite the newcomers, but I haven't actively engaged you. If you are a newcomer, how come you have such a handy knowledge of Wiki policies and guidelines? Why is every page in your history one I have contributed to...",False,True,user,1.33966e+08,random,train,0.0,2007.0
134484,:::::I guess it comes down to substantial and my experience with the Danish version of it being being a bit more loosely defined (such as there is no actual demand of the ball being a tennis ball or who throws the ball).,False,True,article,5.49524e+08,random,train,0.0,2013.0
59535,"` Please stop. If you continue to vandalize Wikipedia, you will be blocked from editing. `",False,True,user,1.95916e+08,random,train,0.2,2008.0
39227,` :I removed them. `,False,True,article,1.23612e+08,random,train,0.0,2007.0
104536,"` == Stray punctuation == FYI, I believe this is now fixed per your query at . Thanks! `",False,True,user,3.81579e+08,random,train,0.0,2010.0
83332,REDIRECT Talk:List of diplomatic missions of Switzerland,False,True,article,2.8586e+08,random,train,0.0,2009.0
3429,"More Dutch speakers == I would say about 23 million people speak Dutch rahter than 20 million: * Netherlands 16 million * Belgium 6 million * Suriname, Antilles, other communities 1 million * total 23 million [anonymous] I don't know where you get those numbers, where comes the 1 million number? population: * Antilles 212,226 * Aruba 103,000 * Suriname 438,144 the population of these three regions doesnt reach one million, and Dutch is definetly not a national language in these places. Plus, the Netherlands and Belgium have a lot of immigrants. So I belive more in the 20 million figure, rather than the 23 million one.- 5 July 2005 11:27 (UTC) :There are for example probably about 12 million native speakers in the Netherlands, once you sutract immigrants, Frisian, Limburgish, and other Germanic language speakers. Suriname definitely isn't natively Dutch speaking; some of the languages ...",False,True,article,1.82048e+07,random,train,0.0,2005.0
