#  MemorAI - GPT-2 Simple GPU on Alex

In [None]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple

# verify GPU type
# NOTE: Colaboratory uses either a Nvidia T4 GPU or an Nvidia K80 GPU. The T4 is 
#       slightly faster than the old K80 for training GPT-2, and has more memory 
#       allowing you to train the larger GPT-2 models and generate more text.
!nvidia-smi

TensorFlow 1.x selected.
  Building wheel for gpt-2-simple (setup.py) ... [?25l[?25hdone
Sun Sep 26 03:27:51 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.63.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                             

In [None]:
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files

# T5 Ref: https://colab.research.google.com/drive/1Sw_1gUPudXw7qBGey-ctSKvY87iFUtGP?authuser=1#scrollTo=g_i7eoklU6KI
# GPT2 Simple Ref: https://minimaxir.com/2019/09/howto-gpt2/
#GPT2-Simple Git:  https://github.com/minimaxir/gpt-2-simple
REPO_PATH = '/drive/MyDrive/MyDrive/Berkeley/W210/w210_Capstone_Project/Repo/memorai/'
TRAINING_TXT_FILE = 'alex_tedtalk.txt'

# mount Google Drive
gpt2.mount_gdrive()
gpt2.copy_file_from_gdrive(TRAINING_TXT_FILE) # can't get into any folder path for some unknown reason

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

Mounted at /content/drive


## Fine-tuning

In [None]:
# download GPT from GCS and saves it to Colaboratory VM at /models/<model_name>.
# * `124M` (default): the "small" model, 500MB on disk.
# * `355M`: the "medium" model, 1.5GB on disk.
# * `774M`: the "large" model, cannot currently be finetuned with Colaboratory but can be used to generate text from the pretrained model (see later in Notebook)
# * `1558M`: the "extra large", true model. Will not work if a K80/P4 GPU is attached to the notebook. (like `774M`, it cannot be finetuned).
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 597Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 6.52Mit/s]
Fetching hparams.json: 1.05Mit [00:00, 662Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:07, 68.0Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 219Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 8.46Mit/s]
Fetching vocab.bpe: 1.05Mit [00:00, 8.50Mit/s]


In [None]:
# NOTE: If you want to rerun this cell, restart the VM first (Runtime -> Restart Runtime). 
#       You will need to rerun imports but not recopy files.
# NOTE: GPT2 defaults to a batch size of 1. Each "example" defaults to 1024 tokens. 
#       So a step (i.e. a training step is one gradient update) of GPT2 defaults to 1024 tokens.
# DOCs for gpt2.finetune:
#       'restore_from': Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
#       'sample_every' Number of steps to print example output
#       'print_every' Number of steps to print training progress.
#       'learning_rate'  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
#       'run_name' subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
#       'overwrite' Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 
RUN_NAME = 'test_run'
TF_SESS = gpt2.start_tf_sess()

gpt2.finetune(
        TF_SESS,
        dataset=TRAINING_TXT_FILE,
        model_name='124M',
        steps=10,
        restore_from='fresh',
        run_name=RUN_NAME,
        print_every=2,
        sample_every=200,
        save_every=10)

# copy to gdrive
gpt2.copy_checkpoint_to_gdrive(run_name=RUN_NAME)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt
Loading dataset...


100%|██████████| 1/1 [00:00<00:00, 3618.90it/s]

dataset has 2747 tokens
Training...





[2 | 17.21] loss=2.85 avg=2.85
[4 | 26.09] loss=2.35 avg=2.60
[6 | 34.96] loss=1.73 avg=2.31
[8 | 43.84] loss=1.21 avg=2.03
[10 | 52.67] loss=1.04 avg=1.83
Saving checkpoint/test_run/model-10


## Model Loading & Inference

Running the next cell will copy the `.rar` checkpoint file from your Google Drive into the Colaboratory VM.

In [None]:
LOAD_MODEL = True
RUN_NAME = 'test_run'

if LOAD_MODEL:
    # copy the .rar checkpoint file from your Google Drive into the Colaboratory VM
    gpt2.copy_checkpoint_from_gdrive(run_name=RUN_NAME)

    # NOTE: If you want to rerun this cell, restart the VM first (Runtime -> Restart 
    #       Runtime). You will need to rerun imports but not recopy files.
    TF_SESS = gpt2.start_tf_sess()
    gpt2.load_gpt2(TF_SESS, run_name=RUN_NAME)

Loading checkpoint checkpoint/test_run/model-10
INFO:tensorflow:Restoring parameters from checkpoint/test_run/model-10


If you're creating an API based on your model and need to pass the generated text elsewhere, you can do `text = gpt2.generate(sess, return_as_list=True)[0]`

You can also pass in a `prefix` to the generate function to force the text to start with a given character sequence and generate text from there (good if you add an indicator when the text starts).

You can also generate multiple texts at a time by specifing `nsamples`. Unique to GPT-2, you can pass a `batch_size` to generate multiple samples in parallel, giving a massive speedup (in Colaboratory, set a maximum of 20 for `batch_size`).

Other optional-but-helpful parameters for `gpt2.generate` and friends:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
# text = gpt2.generate(TF_SESS, return_as_list=True)[0]
gpt2.generate(
        TF_SESS,
        length=50,
        temperature=0.5,
        prefix="Alex loves climbing because",
        nsamples=1,
        batch_size=1,
        run_name=RUN_NAME)

Alex loves climbing because it's so different. It's so different from the way you climb, because you're climbing with a rope. It's so different from the way you walk, because you're walking on a loose ground. It's so different from the way you
