# Toxic Preprocessing

This notebook tokenizes the `Toxic comments` datasets.

The tokenizing uses multiprocessing, and returns a list of ints, and (optionally) a dictionary which turns those ints into strings.

In [1]:
% load_ext autoreload
% autoreload 2

In [2]:
from pathlib import Path
import numpy as np
import pickle

from lm.preprocess import ToxicTokenizer

### 1. Training

In [3]:
train_path = Path('toxic_data/train.csv')

Note that the path input is a list; if multiple paths are in the list, they will be concatenated together.

In [4]:
dict_path = Path('word2int.pickle')
with dict_path.open('rb') as file:
    d = pickle.load(file)

In [5]:
train_tokenizer = ToxicTokenizer([train_path])

Loaded 159571 articles


`preprocess` will accept a dictionary as input; in this case, the dictionary created when training the language model, using the wikitext dataset, will be used.

This will allow that model to be finetuned for this task.

In [6]:
tokenized_train_ints = train_tokenizer.preprocess(word2int=d)

Processed 10000 comments
Processed 20000 comments
Processed 30000 comments
Processed 40000 comments
Processed 50000 comments
Processed 60000 comments
Processed 70000 comments
Processed 80000 comments
Processed 90000 comments
Processed 100000 comments
Processed 110000 comments
Processed 120000 comments
Processed 130000 comments
Processed 140000 comments
Processed 150000 comments
Tokenized articles!
Coverage is 91.46685216843248 %


In [7]:
labels, header2index = train_tokenizer.get_labels()

## 2. Test / Val

In [8]:
test_comments = Path('toxic_data/test.csv')
test_labels = Path('toxic_data/test_labels.csv')

In [9]:
valtest_tokenizer = ToxicTokenizer([test_comments], label_filepaths=[test_labels])

Loaded 153164 articles


In [10]:
tokenized_test_ints = valtest_tokenizer.preprocess(word2int=d)

Processed 10000 comments
Processed 20000 comments
Processed 30000 comments
Processed 40000 comments
Processed 50000 comments
Processed 60000 comments
Processed 70000 comments
Processed 80000 comments
Processed 90000 comments
Processed 100000 comments
Processed 110000 comments
Processed 120000 comments
Processed 130000 comments
Processed 140000 comments
Processed 150000 comments
Tokenized articles!
Coverage is 90.07447419439735 %


In [17]:
test_labels, test_header2index = valtest_tokenizer.get_labels()

## 3. Save everything

In [18]:
np.save('toxic_train_int_tokens.npy', np.asarray(tokenized_train_ints))

In [19]:
np.save('toxic_test_int_tokens.npy', np.asarray(tokenized_test_ints))

In [21]:
assert test_header2index == header2index

In [22]:
dict_path = Path('toxic_header2index.pickle')
with dict_path.open(mode='wb') as file:
    pickle.dump(header2index, file, protocol=pickle.HIGHEST_PROTOCOL)

In [23]:
np.save('train_labels.npy', np.asarray(labels))

In [24]:
np.save('test_labels', np.asarray(test_labels))