# Pre-training of GPT-2

In this notebook we pre-train GPT-2 with the EmpathicDialog dataset to uses as a starting point for bringing in human feedback and to use as a baseline for comparision to our final model.

In [None]:
%tensorflow_version 1.13.1

Connect your own google drive to this notebook for saving the trained models:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
%cd /content/drive/My Drive/nlp-2021-vda/pre-training-gpt2
!pwd

Download the initial GPT-2 Model and datasets from Google Cloud storage:

In [None]:
import os
if not os.path.isdir('./datasets'):
    !gsutil -m cp -r gs://nlp-lab/pre-training-gpt2/datasets ./
if not os.path.isdir('./gpt-2'):
    !gsutil -m cp -r gs://nlp-lab/pre-training-gpt2/gpt-2 ./

Download our fine-tuned GPT-2 model, if you want to fine-tune on top. (uncomment lines below):

In [None]:
#!gsutil -m cp -r gs://nlp-lab/pre-training-gpt2/checkpoint ./

In [None]:
from training import pre_train_gpt2

In [None]:
sess = pre_train_gpt2.start_tf_sess()

In [None]:
#Parameters Descriptions

# sess - Tensorflow session
# train_dataset - path where the data is located
# val_dataset - path where the data is located
# steps - for how many steps do you wanna train the model (-1 means it will train infinitely
# model_name - Initial GPT model name i.e 124M or 335M or 775M
# model_dir - path where the initial GPT model is stored
# batch_size - Batch Size
# learning_rate - Learning Rate
# accumulate gradients - Accumulate gradients across N minibatches
# context_maxlen - maximum length of context tokens
# response_maxlen - maximum length of response tokens
# history_len - How many previous dialogues should be used in the context
# patience - how many steps do we wait for validation loss to go down
# restore_from - Either "latest", "fresh", or a path to a checkpoint file
# run_name - Run id. Name of subdirectory in checkpoint/
# checkpoint_dir - path where the checkpoints should be stored or located
# multi_gpu - set True, if you have multiple GPUs
# print_every - Print stats every N steps
# optimizer - which optimizer to use
# overwrite - Set true, if you wanna overwrite previous checkpoints

pre_train_gpt2.finetune(sess=sess,
             train_dataset='datasets/empatheticdialogues/train.csv',
             val_dataset='datasets/empatheticdialogues/valid.csv',
             steps=-1,
             model_name='345M',
             model_dir='gpt-2/models',
             batch_size=8,
             learning_rate=0.000007,
             accumulate_gradients=5,
             context_maxlen=100,
             response_maxlen=100,
             history_len=4,
             patience=50,
             restore_from='latest',
             run_name='355M',
             checkpoint_dir='checkpoint',
             multi_gpu=False,
             print_every=1,
             max_checkpoints=1,
             optimizer='adam',
             overwrite=False)