# Text Classification Preprocessing Example: Toxicity

This notebook contains preprocessing steps for setting up a text classification dataset for ENN.
The example problem is Wikipedia toxic comment classification, i.e., toxicity.
These steps serve as a reference for applying ENN to other text classification problems.

WARNING: Please be aware that the toxicity dataset contains many highly-offensive text comments, which may be encountered when handling the dataset. The inclusion of such offensive text is intentional. Part of the goal of the problem is to identify such innapropriate and unhelpful content.

Note: To follow this notebook interactively, jupyter can be run locally or on a remote machine. General documentation for jupyter can be found here: https://jupyter-notebook.readthedocs.io/en/stable/; and sample instructions for running remotely can be found here: https://towardsdatascience.com/running-jupyter-notebooks-on-remote-servers-603fbcc256b3.

## Load raw data

This assumes the data is in a three-column tsv file, as in `toxicity-raw-data-sample.tsv`.
It is tsv instead of csv because the comments may include commas, but not tabs.
The file has the following columns:
- 'label': the class label of the sample, i.e., 0 for non-toxic, 1 for toxic.
- 'text': the raw text of the Wikipedia comment.
- 'split': the data split of the sample, either 'train', 'dev' (i.e., for validation), or 'test'. 

In [10]:
import pandas as pd
raw_data_file = 'toxicity-raw-data-sample.tsv'
raw_df = pd.read_csv(raw_data_file, sep='\t')
raw_df[:5]

Unnamed: 0,label,text,split
0,0,s Kamen Rider. Ronhjones,train
1,0,"No way, Mercedez PWNZ!!",train
2,0,"MoP, Have you been reading anything I have w...",dev
3,0,"== Country profiles? == Hi, I saw your noti...",test
4,0,"Pete, I have discussed this change. You are ...",test


## Split data

This step separates the data into input (text) and output (labels), for each of the three specified splits.

In [14]:
splits = ['train', 'test', 'dev']
split_labels = {}
raw_split_text = {}
for split in splits:
    labels = raw_df.loc[raw_df['split'] == split]['label'].tolist()
    text = raw_df.loc[raw_df['split'] == split]['text'].tolist()
    split_labels[split] = labels
    raw_split_text[split] = text
print("Labels:", split_labels['train'][:5])
print("Text:", raw_split_text['train'][:5])

Labels: [0, 0, 0, 0, 0]
Text: ['s Kamen Rider. Ronhjones', '  No way, Mercedez PWNZ!!', "  == Re: Coordinator Elections! ==  Well, while I've been a bit tardy these recent times, yes, I will probably stand again now that the pressures of life are easing off! I'm also going to put in some graft and finish the last chunks of my pet project, Military history of New Zealand. Good luck to you!  e ", '  == February 2006 ==  Thanks for experimenting with the page Arminianism on Wikipedia. Your test worked, and has been reverted or removed. Please use the sandbox for any other tests you want to do. Take a look at the welcome page if you would like to learn more about contributing to our encyclopedia.  Thanks.   ', '  what european power took over algeria?']


## Tokenize text
This step creates a mapping between words (or "tokens") and integers, and uses this mapping to encode each text sample as a sequence that can be fed into a Keras model.

This step uses a parameter `vocab_size`, which specifies how many unique words will be in the mapping. Words are preferred by frequency, and words not in the mapping are discarded, i.e., they are deemed too rare to lead to useful generalization.

In [40]:
from keras_preprocessing.text import Tokenizer
vocab_size = 1000
tokenizer = Tokenizer(num_words=vocab_size)                                              
tokenizer.fit_on_texts(raw_split_text['train'])
tok_split_text = {}
for split in splits:
    tokenized = tokenizer.texts_to_sequences(raw_split_text[split])
    tok_split_text[split] = tokenized
print(tok_split_text['train'][:5])

[[235], [43, 145], [429, 117, 183, 139, 59, 7, 322, 85, 368, 236, 268, 6, 68, 369, 108, 86, 10, 1, 3, 430, 24, 161, 73, 54, 162, 2, 223, 5, 53, 4, 1, 293, 3, 33, 224, 269, 294, 431, 3, 118, 129, 836, 2, 8, 323], [432, 837, 87, 14, 25, 1, 35, 17, 40, 27, 838, 839, 4, 56, 59, 645, 28, 295, 67, 119, 1, 646, 14, 82, 65, 8, 124, 2, 32, 163, 7, 134, 39, 1, 515, 35, 34, 8, 48, 58, 2, 51, 42, 2, 164, 370, 87], [37, 840, 647, 225]]


## Save preprocessed data
Use pickle to save the preprocessed data into datafiles, which can then be used by ENN.

In [41]:
import pickle
with open('toxicity_labels_example.pkl', 'wb') as f:
    pickle.dump(split_labels, f)
with open('toxicity_tokens_example.pkl'.format(vocab_size), 'wb') as f:
    pickle.dump(tok_split_text, f)

## Optional: Preprocess pretrained embeddings

The toxicity domain includes the option of using pre-trained word embeddings.
Embeddings map each word to a fixed size vector of real numbers that encodes prior knowledge of the word's meaning.
ENN may find such embeddings to be useful during the evolutionary process.
Such embeddings are trained on huge corpuses of text, so can be useful especially for establishing word meaning in applications where there is not very much data for the problem.

Raw pre-trained embeddings files can be very large (multiple gigs), since they may contain embeddings for millions of words.
So, it is useful to preprocess them once, so that ENN does need to handle repeatedly on each worker.
Preprocessed embeddings can then be accessed in ENN as additional data files.
Common pre-trained embeddings types include word2vec, fasttext, and glove.
These are available for download at various locations. E.g.,
- https://nlp.stanford.edu/projects/glove/
- https://fasttext.cc/

The step for loading these raw files may need to be adjusted for embedding files in alternate formats.

In [48]:
import numpy as np
embeddings_file = 'glove.6B.50d.txt'

# Load pre-trained embeddings file
def load_embedding(word, *arr):
    return word, np.asarray(arr, dtype='float32')
embedding_dictionary = dict(load_embedding(*o.strip().split(" "))
                       for o in open(embeddings_file))
print(embedding_dictionary['the'])

[ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]


In [47]:
# Create embeddings matrix
word_index = tokenizer.word_index
nb_words = min(vocab_size, len(word_index))
embedding_size = len(embedding_dictionary['cat'])
embedding_matrix = np.zeros((nb_words, embedding_size))
for word, i in list(word_index.items()):
    if i < vocab_size:
        embedding_vector = embedding_dictionary.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
print(embedding_matrix[word_index['the']])

[ 4.18000013e-01  2.49679998e-01 -4.12420005e-01  1.21699996e-01
  3.45270008e-01 -4.44569997e-02 -4.96879995e-01 -1.78619996e-01
 -6.60229998e-04 -6.56599998e-01  2.78430015e-01 -1.47670001e-01
 -5.56770027e-01  1.46579996e-01 -9.50950012e-03  1.16579998e-02
  1.02040000e-01 -1.27920002e-01 -8.44299972e-01 -1.21809997e-01
 -1.68009996e-02 -3.32789987e-01 -1.55200005e-01 -2.31309995e-01
 -1.91809997e-01 -1.88230002e+00 -7.67459989e-01  9.90509987e-02
 -4.21249986e-01 -1.95260003e-01  4.00710011e+00 -1.85939997e-01
 -5.22870004e-01 -3.16810012e-01  5.92130003e-04  7.44489999e-03
  1.77780002e-01 -1.58969998e-01  1.20409997e-02 -5.42230010e-02
 -2.98709989e-01 -1.57490000e-01 -3.47579986e-01 -4.56370004e-02
 -4.42510009e-01  1.87849998e-01  2.78489990e-03 -1.84110001e-01
 -1.15139998e-01 -7.85809994e-01]


In [49]:
# Save embedding matrix to file
embeddings_matrix_filename = 'embeddings_matrix.pkl'.format(vocab_size)
with open(embeddings_matrix_filename, 'wb') as f:
    pickle.dump(embedding_matrix, f)