## Setting Up

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import os

In [4]:
DATA_DIR = './'
SEED = 12

## Clean and Prepare Wikipedia Talk Labels: Toxicity Data

In [5]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)

In [6]:
toxicity_annotated_comments = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotated_comments.tsv'), sep = '\t')
toxicity_annotations = pd.read_csv(os.path.join(DATA_DIR, 'toxicity_annotations.tsv'), sep = '\t')

In [7]:
annotations_gped = toxicity_annotations.groupby('rev_id', as_index=False).agg({'toxicity': 'mean'})
all_data = pd.merge(annotations_gped, toxicity_annotated_comments, on = 'rev_id')

In [8]:
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
all_data['comment'] = all_data['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
all_data['is_toxic'] = all_data['toxicity'] > 0.5

In [9]:
# split into train, valid, test
wiki_splits = {}
for split in ['train', 'test', 'dev']:
    wiki_splits[split] = all_data.query('split == @split').drop(columns=['rev_id', 'toxicity', 'logged_in', 'ns', 'sample', 'split', 'year'])

In [10]:
wiki_splits['train'].head()

Unnamed: 0,comment,is_toxic
0,"This: :One can make an analogy in mathematical terms by envisioning the distribution of opinions in a population as a Gaussian curve. We would then say that the consensus would be a statement that represents the range of opinions within perhaps three standard deviations of the mean opinion. sounds arbitrary and ad hoc. Does it really belong in n encyclopedia article? I don't see that it adds anything useful. The paragraph that follows seems much more useful. Are there any political theorists out there who can clarify the issues? It seems to me that this is an issue that Locke, Rousseau, de Toqueville, and others must have debated... SR",False
1,"` :Clarification for you (and Zundark's right, i should have checked the Wikipedia bugs page first). This is a ``bug`` in the code that makes wikipedia work it just means that there is a line of code that may have an error as small as an extra space. It's analogous (in a VERY simplified way) to trying to make something bold in HTML and forgetting to put the at the end, so you'd see something like this: words in bold Instead of this: words in bold It's not like a virus, that is code somebody deliberately wrote in order to infect your computer and damage files, so it won't ``go around.`` JHK `",False
3,"`This is such a fun entry. Devotchka I once had a coworker from Korea and not only couldn't she tell the difference between USA-English and British English, she had trouble telling the difference between different European languages. (Kind of keeps things in perspective, eh?) -) :Not suprising. While I can easily tell the difference between French, German, Italian, Spanish, Dutch, etc., put me in a room with a Chinese, Japanese, Korean, Vietnamese and a Thai speaker and I probably couldn't tell the difference. (If I saw it written I'd probably have somewhat more luck though.) SJK Vietnamese has more syllable-final consonants than Japanese, I think you can tell them apart that way, maybe. Is this right? - Juuitchan Someone suggested: ``Heath Robinson`` and ``Rube Goldberg`` as a vocabulary difference. It's certainly an interesting parallel, but I don't think it really belongs here. They were both artists with their own style, and both are known on both sides of the pond alt...",False
6,"` I fixed the link; I also removed ``homeopathy`` as an exampleit's not anything like a legitimate protoscience, or even half-legit. It's total pseudoscientific nonsense, and not taken seriously as many protosciences are. I'm willing to tolerate a sympathetic and historical treatment of it on its own page, but pages about real science shouldn't be littered with frauds. `",False
7,"`If they are ``indisputable`` then why does the NOAA dispute it? Note that the NOAA is the same source used by advocates of the CFC-ozone-UV-cancer connection. Is the NOAA trustworthy only when it supports an environmentalist position, and discreditable otherwise? Your comment not mine betrays a political agenda, and not a scientifically informed view. Ed Poor`",False


### Write data to file

In [11]:
def write_wiki_to_txt(splits, prefix):
    for split in splits:
        output_path = os.path.join(DATA_DIR, '%s_%s.txt' % (prefix, split))
        with open(output_path, 'w') as f:
            df = splits[split]
            for index, row in df.iterrows():
                label = '0'
                if row['is_toxic']:
                    label = '1'
                f.write(label + '|||' + row['comment'] + '\n')
        

In [12]:
write_wiki_to_txt(wiki_splits, 'wiki')