# Preparing a poetry corpus for training

By [Allison Parrish](http://www.decontextualize.com/)

I wanted to train a VAE with [BPEmb](https://nlp.h-its.org/bpemb/)'s pretrained sub-word embeddings. This notebook helps create the dataset. First, install `bpemb`.

In [None]:
!pip install bpemb

Load `BPEmb` with the desired vocabulary size. The `BPEmb` object downloads the models and embeddings on demand, so this might take a while the first time you run it.

In [1]:
import bpemb
import json, gzip, random

In [None]:
bp = bpemb.BPEmb(lang='en', dim=200, vs=25000)

Download the [Project Gutenberg Poetry Corpus](https://github.com/aparrish/gutenberg-poetry-corpus) and change the path below to its location on your drive. The following loads in all ~3M lines of poetry:

In [17]:
lines = []
for line in gzip.open("/Users/allison/projects/gutenberg-dammit-archive/gutenberg-poetry-v001.ndjson.gz"):
    data = json.loads(line.strip())
    lines.append(data['s'])

In [18]:
len(lines)

3085117

Shuffle:

In [19]:
random.shuffle(lines)

Set the name of the dataset and the size of the training, validation and test sets:

In [28]:
dataset_name = 'poetry_10k_sample'
train_size = 10000
valid_size = 1000
test_size = 1000

Then write out the files, using `bpemb` to encode to the fixed vocabulary size.

In [29]:
!mkdir datasets/{dataset_name}_data

In [31]:
with gzip.open("datasets/%s_data/%s.train.txt.gz" % (dataset_name, dataset_name), "wt") as fh:
    for line in lines[:train_size]:
        fh.write(' '.join(bp.encode(line)) + "\n")

In [32]:
with gzip.open("datasets/%s_data/%s.valid.txt.gz" % (dataset_name, dataset_name), "wt") as fh:
    for line in lines[train_size:train_size+valid_size]:
        fh.write(' '.join(bp.encode(line)) + "\n")

In [33]:
with gzip.open("datasets/%s_data/%s.test.txt.gz" % (dataset_name, dataset_name), "wt") as fh:
    for line in lines[train_size+valid_size:train_size+valid_size+test_size]:
        fh.write(' '.join(bp.encode(line)) + "\n")

In [34]:
with open("datasets/%s_data/vocab.txt" % dataset_name, "w") as fh:
    for item in bp.words:
        fh.write(item + "\n")

You can now train the model as normal with this data. Create a file `config/config_YOUR_DATASET_NAME.py` with your desired hyperparameters and then train with the commands discussed in the README. 