This colaboratory notebook is used to train OpenAI's GPT-2 model on the Dutch language.

Setup:

1) Make sure to enable GPU -> Edit > Notebook Settings > Hardware accelarator

Note: Colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint.

## Mount Google Drive
Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [0]:
!nvidia-smi

Clone repo

In [0]:
!git clone https://github.com/zhemann/gpt-2.git

In [0]:
cd gpt-2

Install requirements

In [0]:
!pip3 install -r requirements.txt

## Sentencepiece
Download, build and install [sentencepiece](https://github.com/google/sentencepiece)


In [0]:
!pip3 install sentencepiece

In [0]:
cd /content

In [0]:
%%bash -e
if ! [[ -f ./spm_train ]]; then
  wget https://github.com/google/sentencepiece/archive/v0.1.82.zip
  unzip v0.1.82.zip
fi

In [0]:
% cd sentencepiece-0.1.82
% mkdir build
% cd build

In [0]:
!cmake ..

In [0]:
!make -j $(nproc)

In [0]:
!sudo make install

In [0]:
!sudo ldconfig -v

## Set Python IO Encoding

In [0]:
!export PYTHONIOENCODING=UTF-8

## Copy dataset
To create a new dataset, copy it from your dive directory into colab. 

In [0]:
cd /content/gpt-2

In [0]:
mkdir data

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/dataset_wiki.txt /content/gpt-2/data/

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/dataset_columns.txt /content/gpt-2/data/

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/dataset_books.txt /content/gpt-2/data/

## Create dictionary files
1. Combine all .txt-files in directory gpt-2/data into one large .txt-file.
2. Create dictionary files based on large .txt-file

In [0]:
cd /content/gpt-2

In [0]:
!sh scripts/concat.sh data datasets_books_columns.txt

In [0]:
!cp datasets_books_columns.txt data/

In [0]:
!sh scripts/createspmodel.sh data/datasets_combined.txt 40000

In [0]:
mkdir models; cd models; mkdir 117MSP;

In [0]:
cd /content/gpt-2

In [0]:
!cp hparams_117M.json models/117MSP/hparams.json
!cp sp.model models/117MSP/
!cp sp.vocab models/117MSP/

## Load trained model
Load your trained model for use in sampling below. Create directory 'models' if it doesn't exist yet.

In [0]:
cd /content/gpt-2

In [0]:
mkdir models

Sometimes the copying messes things up. If this happens, run the following command and re-create models directory (see cell above)

In [0]:
rm -r models

Load model from drive into 'models' directory

In [0]:
!cp -r /content/drive/My\ Drive/checkpoint_from_scratch/117MSP /content/gpt-2/models/

## Create encoded dataset


In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/dataset_books_columns_enc.npz /content/gpt-2/models/117MSP/

First, run `concat.sh` to create one dataset from multiple files, and to add custom newline tokens `<|n|>` to your dataset. This is necessary as SentencePiece does not add such a token to the dictionairy automatically.

In [0]:
!sh scripts/concat.sh data data/dataset_books_columns_concat.txt 

Then, run `encode.sh` to encode your dataset

In [0]:
!sh scripts/encode.sh data/dataset_books_columns_concat.txt 117MSP dataset_books_columns_enc.npz

In [0]:
!cp -r /content/gpt-2/models/117MSP/dataset_books_columns_enc.npz /content/drive/My\ Drive/data_from_scratch/

## Train model
Start training and save model to drive.

In [0]:
cd /content/gpt-2

In [0]:
!PYTHONPATH=src ./train.py --dataset models/117MSP/dataset_columns_enc.npz --model_name '117MSP' --steps 5000 --sample_every 1000 --save_every 4000 --learning_rate 2.5e-4 --run_name run1 

In [0]:
!cp -r /content/gpt-2/models/117MSP/ /content/drive/My\ Drive/checkpoint_from_scratch/

Train model and save to drive

In [0]:
!PYTHONPATH=src ./train.py --dataset models/117MSP/dataset_columns_enc.npz --model_name '117MSP' --steps 4000 --sample_every 1000 --save_every 30000 --learning_rate 2.5e-4 --run_name run1 

In [0]:
!cp -r /content/gpt-2/models/117MSP/ /content/drive/My\ Drive/checkpoint_from_scratch/

Train model and save to drive

In [0]:
!PYTHONPATH=src ./train.py --dataset models/117MSP/dataset_columns_enc.npz --model_name '117MSP' --steps 5000 --sample_every 2000 --save_every 30000 --learning_rate 2.5e-4 --run_name run1 

In [0]:
!cp -r /content/gpt-2/models/117MSP/ /content/drive/My\ Drive/checkpoint_from_scratch/

## Generate samples
Generate conditional samples from the model given a prompt you provide - change top-k hyperparameter if desired (default is 40) 

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 40 --temperature 0.5 --length 300 --model_name '117MSP' --nsamples 5 --truncate '<|endoftext|>'

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 10 --length 300 --model_name '117MSP' --truncate '<|endoftext|>' --include_prefix False

To check flag descriptions, use:

In [0]:
!python3 src/interactive_conditional_samples.py -- --help

Generate unconditional samples from the model 

In [0]:
cd /content/gpt-2

In [0]:
!python3 src/generate_unconditional_samples.py --nsamples 25 --length 300 --temperature 0.7 --model_name "117MSP" top_k 40 | tee /tmp/samples

To check flag descriptions, use:

In [0]:
!python3 src/generate_unconditional_samples.py -- --help

###Unconditional samples manually

In [0]:
cd /content/gpt-2

In [0]:
import fire
import json
import os
import numpy as np
import tensorflow as tf

import model, sample, encoder_sp as encoder

def sample_model(
    model_name='117M',
    seed=None,
    nsamples=0,
    batch_size=1,
    length=None,
    temperature=1,
    top_k=0,
    top_p=0.0,
    run_name='run1'
):
    """
    Run the sample_model
    :model_name=117M : String, which model to use
    :seed=None : Integer seed for random number generators, fix seed to
     reproduce results
    :nsamples=0 : Number of samples to return, if 0, continues to
     generate samples indefinately.
    :batch_size=1 : Number of batches (only affects speed/memory).
    :length=None : Number of tokens in generated text, if None (default), is
     determined by model hyperparameters
    :temperature=1 : Float value controlling randomness in boltzmann
     distribution. Lower temperature results in less random completions. As the
     temperature approaches zero, the model will become deterministic and
     repetitive. Higher temperature results in more random completions.
    :top_k=0 : Integer value controlling diversity. 1 means only 1 word is
     considered for each step (token), resulting in deterministic completions,
     while 40 means 40 words are considered at each step. 0 (default) is a
     special setting meaning no restrictions. 40 generally is a good value.
    :top_p=0.0 : Float value controlling diversity. Implements nucleus sampling,
     overriding top_k if set to a value > 0. A good setting is 0.9.
    """
    enc = encoder.get_encoder(model_name)
    hparams = model.default_hparams()
    with open(os.path.join('models', model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if length is None:
        length = hparams.n_ctx
    elif length > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    with tf.Session(graph=tf.Graph()) as sess:
        np.random.seed(seed)
        tf.set_random_seed(seed)

        output = sample.sample_sequence(
            hparams=hparams, length=length,
            start_token=enc.encode('<|n|>')[0],
            batch_size=batch_size,
            temperature=temperature, top_k=top_k
        )[:, 1:]

        saver = tf.train.Saver()
        ckpt = tf.train.latest_checkpoint(os.path.join('models', model_name, 'checkpoint/%s' % run_name))
        saver.restore(sess, ckpt)

        generated = 0
        while nsamples == 0 or generated < nsamples:
            out = sess.run(output)
            for i in range(batch_size):
                generated += batch_size
                text = enc.decode(out[i])
                print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                
if __name__ == '__main__':
    fire.Fire(sample_model(
      model_name='117MSP',
      seed=None,
      nsamples=0,
      batch_size=1,
      length=None,
      temperature=1,
      top_k=0,
      run_name='run1'))