#  1. Model 1 - GPT-2 345M  Pre-trained model on newspaper columns
To start:
* Execute all cells belonging to step 1

Then, to generate sample texts:
* Execute cells belonging to step 2

Or, to fine-tune the model:
* Execute cells belonging to step 3


For more about `gpt-2-simple`, you can visit [this GitHub repository](https://github.com/minimaxir/gpt-2-simple).





##Step 1.1 Install gpt_2_simple

In [0]:
!pip install -q gpt_2_simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

## Step 1.2 Verify GPU

Colaboratory now uses an Nvidia T4 GPU, which is slightly faster than the old Nvidia K80 GPU for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text. However sometimes the K80 will still be used.

You can verify which GPU is active by running the cell below.

In [0]:
!nvidia-smi

## Step 1.3 Init variables


In [0]:
run_name = 'GPT-2_345M_columns'      
model = "345M"

## Step 1.4 Downloading GPT-2
The next cell downloads the 345M version of GPT-2 from Google Cloud Storage and saves it in the Colaboratory at `/models/{type}`.


In [0]:
model = "345M"
gpt2.download_gpt2(model_name=model)

## Step 1.5 Mounting Google Drive

The best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*.

Running this cell (which will only work in Colaboratory) will mount your personal Google Drive in the VM, which later cells can use to get data in/out. (it will ask for an auth code; that auth is not saved anywhere)


In [0]:
gpt2.mount_gdrive()

## 1.6 Load a Trained Model Checkpoint

Running the next cell will copy the `checkpoint` folder from your Google Drive into the Colaboratory VM.

In [0]:
gpt2.copy_checkpoint_from_gdrive(run_name=run_name, copy_folder=True)

# 2. Generating sample texts
To generate samples, please stick to this section.

See section **3. Fine-tuning the model** if you want to fine-tune the model.

## 2.1 Initialize session

The next cell will allow you to load the retrained model checkpoint + metadata necessary to generate text.

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

In [0]:
sess = gpt2.start_tf_sess()
gpt2.load_gpt2(sess, run_name)

## 2.2 Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text. `generate` generates a single text from the loaded model.

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [0]:
gpt2.generate(sess,
              length=300,
              temperature=0.7,
              prefix="Als er een eretitel Drukste Toeristenstadje Ter Wereld bestond, was Mérida, de stad in de Andes waar ik een paar jaar geleden was, een kanshebber.",
              include_prefix=True,
              truncate="<|endoftext|>",
              nsamples=5,
              batch_size=1,
              run_name=run_name,
              top_k = 40
              )

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp and then download it.

You can rerun the cell as many times as you want for even more generated texts!

Run one of the cells below regarding whether you have pre- and suffixes.

In [0]:
gen_file = 'gpt2_simple_{:%Y%m%d_%H%M%S}_iter=30k_t=0.7.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=320,
                      temperature=0.7,
                      truncate="<|endoftext|>",
                      nsamples=50,
                      batch_size=10,
                      run_name=run_name
                      )

files.download(gen_file)

# 3. Fine-tuning the model
This section can be used to fine-tune the current model even further. An encoded dataset can be loaded from Google Drive, but also raw datasets can be used to fine-tune the model directly.

**IMPORTANT NOTE:** You cannot immediately continue with fine-tuning after you have generated samples in Step 2. You have to **restart the VM first** (Runtime -> Restart Runtime). You will not need to re-copy files, but you do have to rerun imports.

## 3.1 Copying a Dataset
Run the three cells below to create a directory called 'encoded_data' and copy the encoded columns dataset from Google Drive to fine-tune the model

In [0]:
cd /content/

In [0]:
mkdir encoded_data_pre-trained

In [0]:
!cp -r /content/drive/My\ Drive/encoded_data_pre-trained/columns_enc_pre-trained.npz /content/encoded_data_pre-trained/

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset="encoded_data_pre-trained",
              model_name=model,
              steps=15000,
              restore_from='latest',
              print_every=50,
              sample_every=100,
              save_every=5000,
              learning_rate=0.0001,
              run_name=run_name
              )

After the model is trained, you can copy the checkpoint folder to your own Google Drive.

If you want to download it to your personal computer, it's strongly recommended you copy it there first, then download from Google Drive.

**IMPORTANT NOTE:** You must first rename your {`run_name`} folder in Google Drive before you are able to execute `copy_checkpoint_to_gdrive` without any errors. gpt2_simple, unfortunately, does not support updating existing directories but throws an error instead.

In [0]:
gpt2.copy_checkpoint_to_gdrive(run_name=run_name, copy_folder=True)

# Etcetera

If the notebook has errors (e.g. GPU Sync Fail or out-of-memory/OOM), force-kill the Colaboratory virtual machine and restart it with the command below:

In [0]:
!kill -9 -1