#  Corpus pre-tokenization

Takes a corpus as input and applies a tokenizer, generating a new pre-processed corpus

## Config

Name of the file with the corpus to process

In [None]:
inputcorpus = "corpus/chupitos.txt"

Method to use for reading this corpus

In [None]:
from neurowriter.corpus import Corpus

reader = Corpus.load_multilinetxt

Tokenizer class to use

In [None]:
from neurowriter.tokenizer import WordTokenizer, SubwordTokenizer

tokenizer = SubwordTokenizer()

Output file with processed corpus to be generated

In [None]:
outputcorpus = "corpus/chupitos_bpe.json"

## Run

Read corpus

In [None]:
corpus = reader(inputcorpus)
corpus[0:3]

Fit tokenizer on corpus

In [None]:
%%time
tokenizer.fit(corpus)

Transform corpus

In [None]:
transformed = Corpus([tokenizer.transform(doc) for doc in corpus])
transformed[0:3]

Save resultant processed corpus

In [None]:
transformed.save_json(outputcorpus)

## Some corpus statistics

In [None]:
import seaborn as sns
%matplotlib inline

Distribution of lengths (in tokens)

In [None]:
lengths = [len(doc) for doc in transformed]
print("Max length", max(lengths))
sns.distplot(lengths)