# Preprocessing

This notebook tokenizes the `IMDB` dataset.

The tokenizing uses multiprocessing, and returns a list of ints, and (optionally) a dictionary which turns those ints into strings.

In [1]:
% load_ext autoreload
% autoreload 2

In [2]:
from pathlib import Path
import numpy as np
import pickle

from lm.preprocess import IMDBTokenizer

Load the wikitext tokenizer, since we want to use models trained on the wikitext dataset

In [3]:
dict_path = Path('word2int.pickle')
with dict_path.open(mode='rb') as file:
    word2int = pickle.load(file)

### 1. Training

In [4]:
train_path = Path('aclImdb/train')

Note that the path input is a list; if multiple paths are in the list, they will be concatenated together.

In [5]:
train_tokenizer = IMDBTokenizer([train_path], chunks=2)

Loaded 25000 articles


`preprocess` will accept a dictionary as input; if this is the case, that dictionary will be used when creating the integer list, and no dictionary will be returned.

This is how `preprocess` is used for the test and val sets.

In [6]:
tokenized_train_ints = train_tokenizer.preprocess(word2int=word2int)

Processed 10000 comments
Processed 20000 comments
Tokenized articles!
Coverage is 92.99285909719396 %


In [7]:
train_labels = train_tokenizer.get_labels()

In [8]:
assert len(train_labels) == len(tokenized_train_ints)

## 2. Test / Val

In [9]:
test_path = Path('aclImdb/test')

In [10]:
test_tokenizer = IMDBTokenizer([test_path])

Loaded 25000 articles


In [11]:
tokenized_test_ints = test_tokenizer.preprocess(word2int=word2int)

Processed 10000 comments
Processed 20000 comments
Tokenized articles!
Coverage is 93.01979655541021 %


In [12]:
test_labels = test_tokenizer.get_labels()

## 3. Save everything

In [13]:
np.save('imdb_train_int_tokens.npy', np.asarray(tokenized_train_ints))
np.save('imdb_train_labels.npy', train_labels)

In [14]:
np.save('imdb_test_int_tokens.npy', np.asarray(tokenized_test_ints))
np.save('imdb_test_labels.npy', test_labels)