Fight online abuse - Text Classification with RNNs 
=====

Deep Learning for text classification with Torchtext, PyTorch & FastAI

In [7]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

Introduction
----

This notebook shows how to: 
1. Pre-processes NLP datasets for training using the Torchtext library, 
2. Create a RNN model (GRU, LSTM, standard RNN) using the PyTorch library and
3. Train the model using the FastAI library.

The dataset used is from the [Kaggle Toxic Comment Classification Challenge competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge). 

This notebook uses GPU by default. If you would like to run it on CPU instead, disable it by changing `USE_GPU` to `False`.

In [8]:
USE_GPU = True

## 1. Loading Dataset


### 1.1. Download Dataset

First we download and extract the dataset from the [Kaggle competition page](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data)  

In [25]:
DATA_PATH = 'data/toxic-comments/'
os.makedirs(DATA_PATH, exist_ok=True)

In [28]:
ZIP_DATA_PATH = f'{DATA_PATH}zip_files/'

In [25]:
!mkdir -p {ZIP_DATA_PATH} && kaggle competitions download -c jigsaw-toxic-comment-classification-challenge -p {ZIP_DATA_PATH} 


downloading https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/sample_submission.csv.zip

sample_submission.csv.zip 100% |####################| Time: 0:00:00   3.8 MiB/s

downloading https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/test.csv.zip

test.csv.zip 100% |#################################| Time: 0:00:00  31.6 MiB/s

downloading https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/train.csv.zip

train.csv.zip 100% |################################| Time: 0:00:00  39.0 MiB/s

downloading https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/download/test_labels.csv.zip

test_labels.csv.zip 100% |##########################| Time: 0:00:00   5.0 MiB/s



In [29]:
!ls {ZIP_DATA_PATH}

sample_submission.csv.zip test_labels.csv.zip
test.csv.zip              train.csv.zip


In [30]:
RAW_DATA_PATH = f'{DATA_PATH}raw_data_files/'
os.makedirs(RAW_DATA_PATH, exist_ok=True)

In [13]:
import glob
import zipfile
def unzip(path, targ_path):
    """ Function to unzip files in data path"""
    for file in glob.glob(path):
        zip_ref = zipfile.ZipFile(file, 'r')
        zip_ref.extractall(targ_path)
        zip_ref.close()
        print('%s unzipped' %file.split('/')[-1])    

In [29]:
unzip(f'{ZIP_DATA_PATH}*.zip', f'{RAW_DATA_PATH}')

train.csv.zip unzipped
test_labels.csv.zip unzipped
sample_submission.csv.zip unzipped
test.csv.zip unzipped


### 1.2. Import Datasets

In [5]:
import pandas as pd
import numpy as np

In [31]:
def get_file_names(path, file_format=''):
    """Function to get filenames in data path"""
    file_list=[]
    path = path+'*'+file_format
    for file in glob.glob(path):
        file_list.append(file.split('/')[-1].split('.')[0])
    return file_list   

In [32]:
file_names = get_file_names(f'{RAW_DATA_PATH}', file_format='.csv')  
file_names

['test_labels', 'test', 'train', 'sample_submission']

In [33]:
tables = [pd.read_csv(f'{RAW_DATA_PATH}{fname}.csv', low_memory=False) 
          for fname in file_names]

## 2. Preprocessing Dataset


### 2.1. Looking at the data
The training data contains a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.



In [39]:
sample_submission_df, raw_test_labels_df, raw_train_df, raw_test_df   = tables

In [100]:
raw_train_df.head(2)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0


In [93]:
label_column_list = raw_train_df.columns[2:]; label_column_list

Index(['toxic', 'severe_toxic', 'obscene', 'threat', 'insult',
       'identity_hate'],
      dtype='object')

Here's a couple of examples of comments for each category:

In [104]:
for i in label_column_list:
    print(i)
    display(raw_train_df[getattr(raw_train_df, i) == 1].sample(1).iloc[0]['comment_text'])

toxic


"Fuck you asshole! \n\nGo fuck yourself! Dirty fuckin' asshole! Fuckin' scum!"

severe_toxic


'jam your message up your arse wikipedia dickhead'

obscene


"Read the above numbskull, would one of you vegetables please get a move on and block this I.P. I ain't leaving until this address is blocked so hop to it please assholes.."

threat


'Die \n\nI HATE YOU PRICK YOU DINT DESERVE A PLACE HERE'

insult


'YOU HAVE NO RIGHT FOR Your RACIST block Syrthiss!!  Iguarantee that Jimbo will hear about this and you shalll be punished accordingly!  White Hoods like you DONT belong in WIKI!  Your CYBERLYNCHING DAYS R Over You Racist Jerk!'

identity_hate


'The Israelis are committing massacres in Gaza, but nobody listens. There is even no photos which exposes those who permitted these massacres.'

### 2.2. Cleaning Up Dataset

Next we use `clean` function to replace remove newline characters and replace some punctuations.

In [40]:
import re


def clean(comment):
    # remove \n 
    # torchtext cannot read the .csv files correctly if there are newline characters, so replace with " "
    comment = re.sub("\\n"," ",comment)
    
    # remove leaky elements like ip,user
    comment = re.sub("\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}"," ",comment)
    
    # removing usernames
    comment = re.sub("\[\[.*\]","",comment)
    
    comment = re.sub(r"[\*\"“”\n\\…\+\-\/\=\(\)‘•:\[\]\|’\!;]", " ", str(comment))
    
    comment = re.sub(r"[ ]+", " ", comment)
    
    comment = re.sub(r"\!+", "!", comment)
    
    comment = re.sub(r"\,+", ",", comment)
    
    comment = re.sub(r"\?+", "?", comment)

    return(comment)

In [41]:
raw_train_df['comment_text'] = raw_train_df.comment_text.apply(lambda x: clean(x))
print('train cleaned ...')

raw_test_df['comment_text'] = raw_test_df.comment_text.apply(lambda x: clean(x))
print('test cleaned ...')


train cleaned ...
test cleaned ...


In [43]:
raw_train_df.shape

(159571, 8)

### 2.3. Train/Val/Test Split

Next, we would separate the train dataset into a fixed train and validation set for Torchtext


In [109]:
PROCESSED_DATA_PATH = f'{DATA_PATH}processed_data/'
os.makedirs(PROCESSED_DATA_PATH, exist_ok=True)

In [45]:
from sklearn.model_selection import train_test_split 

In [47]:
# split the training data into a train and validation dataset
train_ds, valid_ds = train_test_split(raw_train_df, test_size=0.2, random_state=9)
len(train_ds), len(valid_ds)

(127656, 31915)

In [48]:
#save train, val, and test datasets 
train_ds.to_csv(f'{PROCESSED_DATA_PATH}/train_ds.csv', index=None)
valid_ds.to_csv(f'{PROCESSED_DATA_PATH}/valid_ds.csv', index=None)

#save full cleaned datasets (train+valid and test) as well
raw_train_df.to_csv(f'{PROCESSED_DATA_PATH}/full_train_ds.csv', index=None)
raw_test_df.to_csv(f'{PROCESSED_DATA_PATH}/test_ds.csv', index=None)


In [110]:
!ls {PROCESSED_DATA_PATH}

full_train_ds.csv test_ds.csv       train_ds.csv      valid_ds.csv


### 2.4. Tokenizing

Next, we use import the `spacy` tokenizer to be used to break sentences into a list of words

In [6]:
from torchtext import *
from torchtext import data, vocab

In [8]:
tokenizer = data.get_tokenizer('spacy')

### 2.5. Declaring Fields

We would like the `comment_text` field to be converted to lowercase and tokenized with the `spacy` tokenizer. So we pass that to the `TEXT` Field.

Since our target `LABEL` fields are already converted into a binary encoding, all we need to do is to tell the Field class that the labels are already processed. We do this by passing the `use_vocab=False` keyword to the constructor


In [9]:
TEXT = data.Field(sequential=True, tokenize=tokenizer, lower=True)
LABEL = data.Field(sequential=False, use_vocab=False)

### 2.6. Construct the Dataset
The `TEXT` and `LABEL` Fields now know what to do when given raw data. Next, we need to tell the Fields what data it should work on. This is where we use `TabularDataset` class in TorchText.

There are various built-in Datasets in torchtext that handle common data formats. For csv/tsv files, the `TabularDataset` class is convenient. Here’s how we would read data from a csv file:


In [11]:
trn_data_fields = [("id", None),
                   ("comment_text", TEXT), ('toxic', LABEL),
                  ('severe_toxic', LABEL), ('obscene', LABEL),
                  ('threat', LABEL), ('insult', LABEL),
                  ('identity_hate', LABEL),]

trn, vld = data.TabularDataset.splits(path=f'{PROCESSED_DATA_PATH}',
                                     train='train_ds.csv', validation='valid_ds.csv',
                                     format='csv', skip_header=True, fields=trn_data_fields)

full_trn = data.TabularDataset(path=f'{PROCESSED_DATA_PATH}full_train_ds.csv', 
                                     format='csv', skip_header=True, fields=trn_data_fields)


In [12]:
test_data_fields = [("id", None),
                   ("comment_text", TEXT)]

tst = data.TabularDataset(path=f'{PROCESSED_DATA_PATH}test_ds.csv',
                                     format='csv', skip_header=True, fields=test_data_fields)


### 2.7. Load pretrained vectors & Build vocabulary

In [13]:
VECTOR_PATH = '.vector_cache'
!ls {VECTOR_PATH}

glove.6B.100d.txt     glove.6B.200d.txt.pt  glove.6B.50d.txt.pt
glove.6B.100d.txt.pt  glove.6B.300d.txt     wiki.en.vec
glove.6B.200d.txt     glove.6B.50d.txt


In [14]:
pretrained_vectors = 'glove.6B.50d' 
MAX_CHARS = 20000

For the TEXT field to convert words into integers, it needs to be told what the entire vocabulary is. To do this, we run `TEXT.build_vocab`, on the full training dataset.


In [15]:
TEXT.build_vocab(full_trn, vectors=pretrained_vectors, max_size = MAX_CHARS)

Let's take a look at what the vocab looks like.

In [16]:
TEXT.vocab.freqs.most_common(10)

[('.', 514412),
 ('the', 495846),
 (',', 470871),
 ('"', 379291),
 ('to', 297111),
 ('i', 239115),
 ('of', 224181),
 ('and', 222987),
 ('you', 219888),
 ('a', 214419)]

### 2.8. Creating the data iterator, batch wrapper & FasAI ModelData
During training, we'll be using the `BucketIterator`.
When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batches:

e.g. [ [3, 15, 2, 7], [4, 1], [5, 5, 6, 8, 1] ] -> [ [3, 15, 2, 7, 0], [4, 1, 0, 0, 0], [5, 5, 6, 8, 1] ]

The `BucketIterator` groups sequences of similar lengths together for each batch to minimize padding. 

In [18]:
batch_size=64

In [19]:
train_iter, val_iter = data.BucketIterator.splits(
                        (trn, vld), batch_size=batch_size,
                        device=(0 if USE_GPU else -1), 
                        sort_key=lambda x: len(x.comment_text), # the BucketIterator needs to be told what function it should use to group the data.
                        sort_within_batch=False, repeat=False)


For the test set, we don't want the data to be shuffled. Hence we'll be using the standard `Iterator`.

In [20]:
test_iter = data.Iterator(tst, batch_size=batch_size, device=(0 if USE_GPU else -1),
                          sort=False, sort_within_batch=False, 
                          repeat=False,train=False, shuffle=False)


Currently, the iterator returns a custom datatype called `torchtext.data.Batch`. This makes Torchtext hard to use with other deep learning libraries like FastAI.

Hence, we'll use the [`BatchWrapper`] function to a return a tuple in the form (x, y) where x is the independent variable (the input to the model) and y is the dependent variable (the supervision data).



In [21]:
class BatchWrapper():
    def __init__(self, dataset, x_var, y_var):
        self.dataset, self.x_var, self.y_var = dataset, x_var, y_var
        
    def __iter__(self):
        for batch in self.dataset:
            x = getattr(batch, self.x_var) # we assume only one input in this wrapper
            
            if self.y_var is not None: # we will concatenate y into a single tensor
                y = torch.cat([getattr(batch, feat).unsqueeze(1) for feat in self.y_var],
                             dim=1).float()
            else:
                y = V(torch.zeros((1)))
                
                
            yield (x, y)
            
    def __len__(self):
        return len(self.dataset)

In [17]:
label_column_list = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']; label_column_list

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [22]:
train_dl = BatchWrapper(train_iter, "comment_text", label_column_list)

val_dl = BatchWrapper(val_iter, "comment_text", label_column_list)

test_dl = BatchWrapper(test_iter, "comment_text", None)


In [7]:
from fastai.nlp import *

In [23]:
model_data = ModelData(PROCESSED_DATA_PATH, trn_dl=train_dl, val_dl=val_dl, test_dl=test_dl)

In [24]:
len(model_data.trn_dl), len(model_data.val_dl), len(model_data.test_dl), len(TEXT.vocab)

(1995, 499, 2394, 20002)

In [42]:
t, z =next(model_data.trn_dl.__iter__())

In [43]:
t.size(), z.size()

(torch.Size([967, 64]), torch.Size([64, 6]))

In [44]:
t, z =next(model_data.test_dl.__iter__())

  return Variable(arr, volatile=not train)


In [45]:
t.size(), z.size()

(torch.Size([327, 64]), torch.Size([1]))

## 3. Training the Model


### 3.1. Model

In [18]:
from torch import *
import torch.nn.functional as F
import torch.nn as nn

In [20]:
class RNNModel(nn.Module):
    """
    Neural Network Module with an embedding layer, a recurent module and an output linear layer
    
    Arguments:
        rnn_type(str) -- type of rnn module to use options are ['LSTM', 'GRU', 'RNN_TANH', 'RNN_RELU']
        input_size(int) -- size of the dictionary of embeddings
        embz_size(int) -- the size of each embedding vector
        hidden_size(int) -- the number of features in the hidden state 
        batch_size(int) -- the size of training batches 
        output_size(int) -- the number of output classes to be predicted
        num_layers(int, optional) -- Number of recurrent layers. Default=1 
        dropout(float, optional) -- dropout probabilty. Default=0
        bidirectional(bool, optional) -- If True, becomes a bidirectional RNN. Default=False
        tie_weights(bool, optional) -- if True, ties the weights of the embedding and output layer. Default=False
    
    Inputs: input
        input of shape (seq_length, batch_size) -- tensor containing the features of the input sequence
    
    Returns: output
        output of shape (batch_size, output_size) -- tensor containing the sigmoid activation on the 
                                                    output features h_t from the last layer of the rnn, 
                                                    for the last time-step t.
    """
    
    
    def __init__(self, rnn_type, input_size, embz_size, hidden_size, batch_size, output_size,
                num_layers=1, dropout=0.5, bidirectional=True, tie_weights=False):
        super().__init__()
        
        if bidirectional: self.num_directions = 2
        else: self.num_directions = 1

        self.hidden_size, self.output_size, self.embz_size = hidden_size, output_size, embz_size
        self.bidirectional, self.rnn_type, self.num_layers = bidirectional, rnn_type, num_layers
        self.drop = nn.Dropout(dropout)
        self.embedding_layer = nn.Embedding(input_size, embz_size)
        self.output_layer = nn.Linear(hidden_size*self.num_directions, output_size)
        self.init_hidden(batch_size)
        
        if rnn_type in ['LSTM', 'GRU']:
            self.rnn = getattr(nn, rnn_type)(embz_size, hidden_size, num_layers=num_layers, 
                           dropout=dropout, bidirectional=bidirectional)
        else:
            try:
                nonlinearity = {'RNN_TANH':'tanh', 'RNN_RELU':'relu'}[rnn_type]
            except KeyError:
                raise ValueError("""An invalid option for '--rnn_type' was supplied,
                                    options are ['LSTM', 'GRU', 'RNN_TANH', 'RNN_RELU']""")
            self.rnn = nn.RNN(embz_size, hidden_size, num_layers=num_layers, 
                           dropout=dropout, bidirectional=bidirectional, nonlinearity=nonlinearity)    
            
        if tie_weights:
            if hidden_size != embz_size:
                raise ValueError("When using the tied flag, hidden size must be equal to embeddign size")
            elif bidirectional:
                raise ValueError("When using the tied flag, set bidirectional=False")
            self.output_layer.weight = self.embedding_layer.weight    
                    
    def init_emb_weights(self, vector_weight_matrix):
        self.embedding_layer.weight.data.copy_(vector_weight_matrix)
        
    def init_identity_weights(self):
        if self.rnn_type == 'RNN_RELU':
            self.rnn.weight_ih_l0.data.copy_(torch.eye(self.hidden_size, self.embz_size))
            self.rnn.weight_hh_l0.data.copy_(torch.eye(self.hidden_size, self.hidden_size))

            if self.bidirectional:
                self.rnn.weight_ih_l0_reverse.data.copy_(torch.eye(self.hidden_size, self.embz_size))
                self.rnn.weight_hh_l0_reverse.data.copy_(torch.eye(self.hidden_size, self.hidden_size))  
        else:
            pass
    
    def init_hidden(self, batch_size):
        if self.rnn_type == 'LSTM':
            self.hidden = (V(torch.zeros(self.num_layers*self.num_directions, batch_size, self.hidden_size)),
                           V(torch.zeros(self.num_layers*self.num_directions, batch_size, self.hidden_size)))
        else:
            self.hidden = V(torch.zeros(self.num_layers*self.num_directions, batch_size, self.hidden_size))
                
    def forward(self, seq):
        batch_size = seq[0].size(0)
        if self.hidden[0].size(1) != batch_size:
            self.init_hidden(batch_size)
        input_tensor = self.drop(self.embedding_layer(seq))
        output, hidden = self.rnn(input_tensor, self.hidden)
        self.hidden = repackage_var(hidden)
        output = self.drop(self.output_layer(output))
        return F.sigmoid(output[-1, :, :])

In [28]:
vector_weight_matrix = TEXT.vocab.vectors
input_size = vector_weight_matrix.size(0)
hidden_size = vector_weight_matrix.size(1)
output_size = 6
embz_size = vector_weight_matrix.size(1)
batch_size = batch_size
rnn_type = 'GRU'
model = RNNModel(rnn_type, input_size, embz_size, hidden_size, batch_size, output_size); model

RNNModel(
  (drop): Dropout(p=0.5)
  (embedding_layer): Embedding(20002, 50)
  (output_layer): Linear(in_features=100, out_features=6)
  (rnn): GRU(50, 50, dropout=0.5, bidirectional=True)
)

Next, we load the pretrained vectors (glove.6B.50d) into the embedding layer

In [29]:
model.init_emb_weights(vector_weight_matrix)
model.embedding_layer.weight

Parameter containing:
 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  ...   0.0000  0.0000  0.0000
 0.1516  0.3018 -0.1676  ...  -0.3565  0.0164  0.1022
          ...             ⋱             ...          
 0.4058  0.2113  0.0083  ...   0.5022 -0.7418 -0.3712
 1.0987 -0.2232  0.3466  ...  -0.3047  1.0554 -0.2385
-0.1730 -0.2120  0.6912  ...   0.8054 -0.3160 -0.6211
[torch.cuda.FloatTensor of size 20002x50 (GPU 0)]

In [None]:
if USE_GPU:
    model.cuda()

### 3.2. Training

In [30]:
MODEL_PATH = f'{DATA_PATH}models/{rnn_type}/'

In [31]:
os.makedirs(f'{MODEL_PATH}', exist_ok=True)

In [32]:
lo = LayerOptimizer(optim.Adam, model, 1e-2, 1e-5)
loss_func = nn.BCELoss() 

In [33]:
on_end = lambda sched, cycle: save_model(model, f'{MODEL_PATH}cyc_{cycle}')
cb = [CosAnneal(lo, len(model_data.trn_dl), cycle_mult=2, on_cycle_end=on_end)]
fit(model, model_data, 2**4, lo.opt, loss_func, callbacks=cb)


                                                               

  


epoch      trn_loss   val_loss   
    0      0.371394   0.069593  
    1      0.371476   0.062126                                 
    2      0.368127   0.060966                                 
    3      0.374017   0.07338                                  
    4      0.373226   0.062801                                 
    5      0.367177   0.059734                                 
    6      0.364643   0.055366                                 
    7      0.37301    0.069894                                 
    8      0.372339   0.068181                                 
    9      0.373894   0.059369                                 
    10     0.374497   0.063353                                 
    11     0.370822   0.063703                                 
    12     0.37109    0.060205                                 
    13     0.366658   0.056104                                 
    14     0.367601   0.053094                                 
    15     0.372098   0.061729       

[array([ 0.06173])]

## 4. Making Predictions

### 4.1. Generate predictions

In [34]:
test_pred = []

for x, y in tqdm(model_data.test_dl):
    preds = model(VV(x))
    preds = preds.data.cpu().numpy()
    test_pred.append(preds)

        

100%|██████████| 2394/2394 [01:21<00:00, 29.32it/s]


  


In [35]:
test_preds = np.concatenate(test_pred)

In [36]:
test_preds.shape

(153164, 6)

### 4.2. Make submission on Kaggle

In [52]:
test_ds = pd.read_csv(f'{PROCESSED_DATA_PATH}test_ds.csv', low_memory=False); test_ds.shape

(153164, 2)

In [53]:
for i , col in enumerate(label_column_list):
    test_ds[col] = test_preds[:, i]


In [54]:
test_ds.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you wi...,0.68683,0.123798,0.679709,0.154812,0.458549,0.362866
1,0000247867823ef7,"== From RfC == The title is fine as it is, IMO.",0.005412,0.000691,0.003433,0.002556,0.003729,0.006844
2,00013b17ad220c46,""" == Sources == * Zawe Ashton on Lapland — / """,0.005122,0.000674,0.003221,0.00263,0.003585,0.007108
3,00017563c3f7919a,":If you have a look back at the source, the in...",0.00371,0.00058,0.002605,0.002121,0.002988,0.005025
4,00017695ad8997eb,I do not anonymously edit articles at all.,0.005053,0.000671,0.003246,0.002612,0.003537,0.006941


In [55]:
SUBMISSION_PATH = f'{DATA_PATH}submission/'

In [56]:
os.makedirs(SUBMISSION_PATH, exist_ok=True)

In [57]:
#writes the submission file to disk
test_ds.drop("comment_text", axis=1).to_csv(f'{SUBMISSION_PATH}submission.csv', index=False)

In [64]:
#submits predictions to kaggle and returns public score
!kg submit {SUBMISSION_PATH}submission.csv

0.9517

### 4.3. Looking at example predictions

In [104]:
def to_label(x, threshold=0.5):
    if x > threshold:
        return 1
    else:
        return 0

In [107]:
for i in range(len(label_column_list)):
    test_ds[label_column_list[i]] = test_ds[label_column_list[i]].apply(to_label, threshold=0.45)

In [108]:
test_ds.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you wi...,1,0,1,0,1,0
1,0000247867823ef7,"== From RfC == The title is fine as it is, IMO.",0,0,0,0,0,0
2,00013b17ad220c46,""" == Sources == * Zawe Ashton on Lapland — / """,0,0,0,0,0,0
3,00017563c3f7919a,":If you have a look back at the source, the in...",0,0,0,0,0,0
4,00017695ad8997eb,I do not anonymously edit articles at all.,0,0,0,0,0,0


In [109]:
for i in label_column_list:
    print(i.upper())
    display(test_ds[getattr(test_ds_, i) == 1].sample().iloc[0]['comment_text'])


TOXIC


'schumaker is a dick!'

SEVERE_TOXIC


'benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night with fucking curps there both nigger shitty asholes who fuck eachother==benon!! ==what benon does at night w

OBSCENE


'WHO THE FUCK CARES, SHIT LIKE THIS IS WHY NOBODY TAKES WIKIPEDIA SERIOUSLY'

THREAT


'WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL DIE OF FACE CANCER AND DIE POOR. WIKIPEDIA IS INHABITED BY FAT CUNTS WHO FUCKING WASTE THEIR LIVES BEING FAT LOSERS FUCK YOU ALL 

INSULT


'fuck you'

IDENTITY_HATE


"Like Rape Words And Bad Gay Jokes Meen Really like'thats what she say'He Say Weird Ass Noise Is Normal And Penis So PewDiePie Please Why Use Words LIKE THAT MAN D"

In [110]:
test_ds_[(test_ds.toxic==0) & (test_ds.severe_toxic==0) & 
        (test_ds.obscene==0) & (test_ds.threat==0) &
        (test_ds.insult==0) & (test_ds.identity_hate==0)].sample().iloc[0]['comment_text']   


':Roll Grandiose in. Ultimately this is for fun and to encourage quality editing, and that goal is achieved with either 64 or 65 participants!'