#  Train a GPT-2 Text-Generating Model on the Dutch language

For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple).


To get started:

1. Copy this notebook to your Google Drive to keep it and save your changes. (File -> Save a Copy in Drive)
2. Make sure you're running the notebook in Google Chrome.
3. Run the cells below:


In [1]:
!pip install -q gpt_2_simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

[?25l[K     |▌                               | 10kB 17.6MB/s eta 0:00:01[K     |█                               | 20kB 22.5MB/s eta 0:00:01[K     |█▌                              | 30kB 25.9MB/s eta 0:00:01[K     |██                              | 40kB 3.6MB/s eta 0:00:01[K     |██▌                             | 51kB 4.4MB/s eta 0:00:01[K     |███                             | 61kB 5.2MB/s eta 0:00:01[K     |███▌                            | 71kB 5.9MB/s eta 0:00:01[K     |████                            | 81kB 6.7MB/s eta 0:00:01[K     |████▌                           | 92kB 7.4MB/s eta 0:00:01[K     |█████                           | 102kB 8.0MB/s eta 0:00:01[K     |█████▌                          | 112kB 8.0MB/s eta 0:00:01[K     |██████                          | 122kB 8.0MB/s eta 0:00:01[K     |██████▌                         | 133kB 8.0MB/s eta 0:00:01[K     |███████                         | 143kB 8.0MB/s eta 0:00:01[K     |███████▌                

W0618 11:49:40.734395 140547183056768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/memory_saving_gradients.py:13: The name tf.GraphKeys is deprecated. Please use tf.compat.v1.GraphKeys instead.



## Verify GPU

Colaboratory now uses an Nvidia T4 GPU, which is slightly faster than the old Nvidia K80 GPU for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text. However sometimes the K80 will still be used.

You can verify which GPU is active by running the cell below.

In [2]:
!nvidia-smi

Tue Jun 18 11:49:41 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   29C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

## Downloading GPT-2

If you're retraining a model on new text, you need to download the GPT-2 model first. 

There are two sizes of GPT-2:
    

*   `117M` (default): the "small" model, 500MB on disk.
*   `345M`: the "medium" model, 1.5GB on disk.

The next cell downloads it from Google Cloud Storage and saves it in the Colaboratory at `/models/117M`.

This model isn't permanently saved in the Colaboratory VM; you'll have to redownload it if you want to retrain it at a later time.

In [0]:
model = "345M"

In [4]:
gpt2.download_gpt2(model_name=model)

Fetching checkpoint: 1.05Mit [00:00, 475Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 74.9Mit/s]                                                   
Fetching hparams.json: 1.05Mit [00:00, 290Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 1.42Git [00:12, 113Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 215Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 49.4Mit/s]                                                
Fetching vocab.bpe: 1.05Mit [00:00, 92.8Mit/s]                                                      


## Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)

In [5]:
gpt2.mount_gdrive()

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


## Select model to train


In [0]:
columns_only_15k = False

In [0]:
# Init run_name and dir_name to load datasets
if columns_only_15k:
  run_name = '345M_columns_only_15k'   
else:
  run_name = '345M_columns_only_30k'   
run_name_checkpoint = 'checkpoint/' + run_name

In [0]:
# Init run_name and dir_name to load datasets
run_name = '345M_all_datasets'   
run_name_checkpoint = 'checkpoint/' + run_name

## Copy encoded dataset

Select two cells below for copying all encoded datasets (columns, books, wiki-pages)

In [0]:
!cp -r /content/drive/My\ Drive/data/encoded_books_columns_wiki/ /content/

In [0]:
dir_name = "encoded_books_columns_wiki"

Select two cells below for copying only the encoded columns and books datasets

In [0]:
!cp -r /content/drive/My\ Drive/data/encoded_books_columns/ /content/

In [0]:
dir_name = "encoded_books_columns"

Select two cells below for copying only the encoded columns dataset

In [0]:
!cp -r /content/drive/My\ Drive/data/encoded_columns/ /content/

In [0]:
dir_name = "encoded_columns"

## Uploading a Text File to be Trained to Colaboratory

Upload **any smaller text file**  (<10 MB) and update the file name in the cell below, then run the cell.

If your text file is larger than 10MB, it is recommended to upload that file to Google Drive first, then copy that file from Google Drive to the Colaboratory VM.

In [0]:
mkdir data; cd data;

In [0]:
gpt2.copy_file_from_gdrive(file_name)

If dataset is > 100MB it is advised to encode it as this improves performance GPU-wise.

In [0]:
gpt2.encode_dataset(file_name, model_name="345M", out_path="columns_encoded.npz")

In [0]:
!cp columns_encoded.npz /content/drive/My\ Drive/data/

## Load a Trained Model Checkpoint

Running the next cell will copy the `checkpoint` folder from your Google Drive into the Colaboratory VM.

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name, copy_folder=True)

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name)

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=dir_name,
              model_name=model,
              steps=15000,
              restore_from='latest',
              print_every=50,
              sample_every=100,
              save_every=5000,
              learning_rate=0.0001,
              run_name=run_name
              )

W0618 11:51:55.278429 140547183056768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py:90: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0618 11:51:55.280948 140547183056768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py:100: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

W0618 11:51:56.553875 140547183056768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/gpt_2.py:164: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0618 11:51:56.562221 140547183056768 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/gpt_2_simple/src/model.py:148: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0618 11:52:10.690787 140547183056768 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/gpt_2_sim

Loading checkpoint checkpoint/345M_columns_only_30k/model-30000


100%|██████████| 1/1 [00:00<00:00, 22.20it/s]

Loading dataset...
dataset has 789166 tokens
Training...
Saving checkpoint/345M_columns_only_30k/model-30000





ig een uitroepteken waar de drie vlot van de Tweede Wereldoorlog staat. Zodat volgens de verhalen die je vermoedt, pakt je schrift (ik vat de titel), je ging je nieuws en je freak daar een eet. Ooit heb ik gelezen dat het kan behoorlijk best aan Nederland.’’
<|endoftext|>
Zeventig jaar geleden was de oorlog afgelopen. Uitgeverij Van der Meulen stond op het idee van de film ‘Catch 22’, in een poging het antwoord er als motifs voor dag en lacht. Vooral de vijf voor het zidgeel rolde het in Nederlandse film ‘Zero point’, in een Ria Valkhuis.
Nou, als ik het logisch was, zou het iets niet lekker zijn.
Morgen worden ze enkel een miljoen Nederlanders die er het uit willen. De vraag is natuurlijk: is het voor het programma, ben je een lijstklap? Dat wil de Laura Nederlanders van Kimono kunnen belegen, vorige week was er een 000-rood boerderij om iets te ontwerpen.
Nou, dan zou ik wel het haar las. Iking.
Er zijn nog twee Nederlanders die erppen, en meestal deel ze nog, om te melden dat het pe

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name, copy_folder=True)

You're done! Feel free to go to the **Generate Text From The Trained Model** section to generate text based on your retrained model.

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

In [0]:
gpt2.generate(sess, return_as_list=True, run_name=run_name[0]

If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [0]:
gpt2.generate(sess,
              length=300,
              temperature=0.7,
              prefix="Als er een eretitel Drukste Toeristenstadje Ter Wereld bestond, was Mérida, de stad in de Andes waar ik een paar jaar geleden was, een kanshebber.",
              include_prefix=True,
              truncate="<|endoftext|>",
              nsamples=5,
              batch_size=1,
              run_name=run_name,
              top_k = 40
              )

Als er een eretitel Drukste Toeristenstadje Ter Wereld bestond, was Mérida, de stad in de Andes waar ik een paar jaar geleden was, een kanshebber. Een retitel die rhytterde als een boom. ,,Geen mensen die er rhyt tegenaan’’, zei ik. ,,De vader zegt vooral: ‘Wij wensen uit ons Beursgenot’.’’
Zelf was ik niet rhyt, het waren wel voorouders die ik derde klasgenoot in de lagere school heb gehaald. Rapp stond erom bekend dat je uit dat verhaal niet eens een echte boom mag heten.
Maar de bame was niet misschien helemaal rhyt, het is geen mooie gedachte. En als die mop nog heet, zegt u dat ook allemaal die twee oude ooms, die elkaar in de gang hadden gezeten. Die waren akademien in 1974 ook verderop gaan.
De vader van de boom was daar al mee op de klas geweest, hij was de eerste die er in rij mensen die ik ernaar vroeg. ,,Dat heb ik niet gedaan’’, zei hij
Als er een eretitel Drukste Toeristenstadje Ter Wereld bestond, was Mérida, de stad in de Andes waar ik een paar jaar geleden was, een kans

In [0]:
gpt2.generate(sess,
              length=300,
              temperature=0.7,
              prefix="We remden bij een dorpje, omdat een veldje met bomen vol opgewonden mannen stond.",
              include_prefix=True,
              truncate="<|endoftext|>",
              nsamples=5,
              batch_size=1,
              run_name=run_name,
              top_k = 40
              )

We remden bij een dorpje, omdat een veldje met bomen vol opgewonden mannen stond. We remden bij nacht op straat, omdat er een opgewonden vrouw in de zaal zat. En omdat er op het dak bijna geen vrouw in de zaal was. We remden bij nacht op straat omdat er op gebeurtenissen gebeurtenissen zijn met namen als Khalil, Mohamad, Adelanto, Ady Sidonia, Afrika en Nabobje.
,,U zat toch bij de carnavalsspelen?”, vroeg ik mijn overbuurvrouw M. in de tent. ,,Wordt u al geholpen?” ,,Ja, u ging als een heilige naar de nieuwe bioscoop, waar hij boeken uit het hart van Godleken kon zien. Dat klonk als lytse Hille.”
M. en zijn vrouw waren erbij, die kwamen uit Heerenveen. Ze hadden een caravan in Bant, bij de uitlaat van de Prinsentuin, waar naar verwijzingen IJlst en Dalen even bij kon.
Zelf had
We remden bij een dorpje, omdat een veldje met bomen vol opgewonden mannen stond. We remden bij Dútsjes, omdat we aan het banaan hadden te klagen. We remden bij de provincie, omdat we daar spenden en drukken. We

In [0]:
gpt2.generate(sess,
              length=300,
              temperature=0.7,
              prefix="Woensdagavond mocht ik met de watertaxi naar de vaste wal. Die is er al bijna 25 jaar maar ik was er nog nooit mee meegevaren.",
              include_prefix=True,
              truncate="<|endoftext|>",
              nsamples=5,
              batch_size=1,
              run_name=run_name,
              top_k = 40
              )

In [0]:
gpt2.generate(sess,
              length=300,
              temperature=0.7,
              prefix="Een rare tijd, net voor kerst, vooral als je helemaal niet in kerststemming bent.",
              include_prefix=True,
              truncate="<|endoftext|>",
              nsamples=5,
              batch_size=1,
              run_name=run_name,
              top_k = 40
              )

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp and then download it.

You can rerun the cell as many times as you want for even more generated texts!

Run one of the cells below regarding whether you have pre- and suffixes.

In [0]:
gen_file = 'gpt2_simple_{:%Y%m%d_%H%M%S}_iter=10k_t=0.7.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=320,
                      temperature=0.7,
                      truncate="<|endoftext|>",
                      nsamples=50,
                      batch_size=10,
                      run_name=run_name
                      )

files.download(gen_file)

In [0]:
gen_file = 'gpt2_simple_{:%Y%m%d_%H%M%S}_iter=10k_t=0.9.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=320,
                      temperature=0.9,
                      truncate="<|endoftext|>",
                      nsamples=50,
                      batch_size=10,
                      run_name=run_name
                      )

files.download(gen_file)

In [0]:
gen_file = 'gpt2_simple_{:%Y%m%d_%H%M%S}_iter=10k_t=1.1.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=320,
                      temperature=1.1,
                      truncate="<|endoftext|>",
                      nsamples=50,
                      batch_size=10,
                      run_name=run_name
                      )

files.download(gen_file)

# Etcetera

If the notebook has errors (e.g. GPU Sync Fail or out-of-memory/OOM), force-kill the Colaboratory virtual machine and restart it with the command below:

In [0]:
!kill -9 -1

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.