# Preprocessing

This notebook tokenizes the `wikitext` datasets (in this case, Wikitext-2).

The tokenizing uses multiprocessing, and returns a list of ints, and (optionally) a dictionary which turns those ints into strings.

In [1]:
% load_ext autoreload
% autoreload 2

In [2]:
from pathlib import Path
import numpy as np
import pickle

from lm.preprocess import WikiTextTokenizer

### 1. Training

In [3]:
train_path = Path('wikitext-2-raw/wiki.train.raw')

Note that the path input is a list; if multiple paths are in the list, they will be concatenated together.

In [4]:
train_tokenizer = WikiTextTokenizer([train_path])

Loaded 629 articles


`preprocess` will accept a dictionary as input; if this is the case, that dictionary will be used when creating the integer list, and no dictionary will be returned.

This is how `preprocess` is used for the test and val sets.

In [6]:
tokenized_train_ints, word2int = train_tokenizer.preprocess()

Processed 100 articles
Processed 200 articles
Processed 300 articles
Processed 400 articles
Processed 500 articles
Processed 600 articles
Tokenized articles!


In [14]:
len(tokenized_train_ints)

2105498

## 2. Test / Val

In [17]:
val_path = Path('wikitext-2-raw/wiki.valid.raw')
test_path = Path('wikitext-2-raw/wiki.test.raw')

In [18]:
valtest_tokenizer = WikiTextTokenizer([val_path, test_path])

Loaded 60 articles
Loaded 64 articles


In [19]:
tokenized_valtest_ints = valtest_tokenizer.preprocess(word2int=word2int)

Processed 100 articles
Tokenized articles!


## 3. Save everything

In [10]:
np.save('wikitext_train_int_tokens.npy', np.asarray(tokenized_train_ints))

In [20]:
np.save('wikitext_valtest_int_tokens.npy', np.asarray(tokenized_valtest_ints))

In [12]:
dict_path = Path('word2int.pickle')
with dict_path.open(mode='wb') as file:
    pickle.dump(word2int, file, protocol=pickle.HIGHEST_PROTOCOL)