<a href="https://colab.research.google.com/github/allan1738/sample-registrar-module/blob/master/copy_of_ConCollabo_Part1_GPT_2_Train_and_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  ConCollabo Part 1:
## Train a GPT-2 Text-Generating Model with GPU
Almost identical to locally training, but since Colab session expires after a few hours, we will take advantage of Google Drive syncing to save checkpoints.

*Last updated: July 7th, 2021*

## GPU

Change Colab's runtime to use GPU and verify it by running the cell below.

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



## Cloning GPT-2

Since we are fine-tuning on a new dataset, we need to download the GPT-2 model first. 

There are three released sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model, cannot currently be finetuned with Colab
* `1558M`: the "extra large", true model. Will not work if a K80 GPU is attached to the notebook. (like `774M`, it cannot be finetuned).

Larger models have more knowledge, but take longer to finetune and longer to generate text. You can specify which base model to use by changing `model_name` in the cells below.

The next cell downloads it from Google Cloud Storage and saves it in the Colab VM at `/models/<model_name>`.

This model isn't permanently saved in the Colab; we have to redownload it if we want to retrain it at a later time. (After session expires)

In [None]:
!git clone https://github.com/nshepperd/gpt-2.git

Cloning into 'gpt-2'...
remote: Enumerating objects: 435, done.[K
remote: Counting objects: 100% (64/64), done.[K
remote: Compressing objects: 100% (51/51), done.[K
remote: Total 435 (delta 19), reused 48 (delta 13), pack-reused 371[K
Receiving objects: 100% (435/435), 4.48 MiB | 23.76 MiB/s, done.
Resolving deltas: 100% (220/220), done.


In [None]:
cd gpt-2

/content/gpt-2


Install prerequisite packages

In [None]:
!python -m pip install -U tensorflow



In [None]:
!python -m pip install fire regex requests tqdm toposort numpy tensorflow -q

[?25l[K     |███▊                            | 10 kB 23.6 MB/s eta 0:00:01[K     |███████▌                        | 20 kB 32.9 MB/s eta 0:00:01[K     |███████████▏                    | 30 kB 28.5 MB/s eta 0:00:01[K     |███████████████                 | 40 kB 18.3 MB/s eta 0:00:01[K     |██████████████████▊             | 51 kB 14.1 MB/s eta 0:00:01[K     |██████████████████████▍         | 61 kB 11.2 MB/s eta 0:00:01[K     |██████████████████████████▏     | 71 kB 12.4 MB/s eta 0:00:01[K     |██████████████████████████████  | 81 kB 13.8 MB/s eta 0:00:01[K     |████████████████████████████████| 87 kB 4.5 MB/s 
[?25h  Building wheel for fire (setup.py) ... [?25l[?25hdone


## Mount Google Drive

As mentioned earlier, Colab sessions are not permanent. We can mount Google Drive to retrieve the input data, and as well as output the trained model.

Run the cell below to mount our personal Google Drive.
(it will ask for an authentication code; the code is not saved anywhere and will be reset when Colab expires/restarts)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


SyntaxError: ignored

## Dataset Preparation

Place the dataset file into the cloned \<gpt-2> folder  
This can be any English text. Keep in mind that every sample should end with \<|endoftext|>

In [None]:
!wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=19Iv70cXYo0cfIPBMvVAN7yRwp9x8966f' -O input.txt


--2021-07-30 04:10:03--  https://docs.google.com/uc?export=download&id=19Iv70cXYo0cfIPBMvVAN7yRwp9x8966f
Resolving docs.google.com (docs.google.com)... 108.177.127.138, 108.177.127.139, 108.177.127.100, ...
Connecting to docs.google.com (docs.google.com)|108.177.127.138|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/06g9tnt7avjghoigsv9v2aep2h5jg7fp/1627618200000/03192608778565483848/*/19Iv70cXYo0cfIPBMvVAN7yRwp9x8966f?e=download [following]
--2021-07-30 04:10:04--  https://doc-04-8s-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/06g9tnt7avjghoigsv9v2aep2h5jg7fp/1627618200000/03192608778565483848/*/19Iv70cXYo0cfIPBMvVAN7yRwp9x8966f?e=download
Resolving doc-04-8s-docs.googleusercontent.com (doc-04-8s-docs.googleusercontent.com)... 108.177.127.132, 2a00:1450:4013:c07::84
Connecting to doc-04-8s-docs.googleusercontent.com (doc-04

## Train Preparation

We need to download the base (pre-trained) GPT-2 model to fine-tune on our dataset.

In [None]:
!python download_model.py 117M

python3: can't open file 'download_model.py': [Errno 2] No such file or directory


In [None]:
!export PYTHONIOENCODING=UTF-8

If there are any pre-existing checkpoints in the Google Drive, copy them to the current session to resume training.

In [None]:
!cp -r /content/drive/MyDrive/Haikus/checkpoint/ /content/gpt-2/ 

## Train

Run the cell below to start training. The code will automatically save checkpoints every 1,000 steps. 

To stop training, stop the cell. This will also automatically save the currently stopped checkpoint to \<gpt-2/checkpoint>

To resume training, run this cell again. This will automatically resume training from where we last left off. 

In [None]:
!PYTHONPATH=src ./train.py --dataset /content/gpt-2/input.txt --model_name '117M'

/bin/bash: ./train.py: No such file or directory


In [None]:
transformers-cli convert --model_type gpt2 --tf_checkpoint run1 --pytorch_dump_output pytorch --config run1/hparams.json

 

SyntaxError: ignored

On Colab, it is recommended to stop training every once in a while to backup checkpoints to the mounted Drive. In the Drive, delete all the previous model files (unless you really need them for history/comparison)



*   model-xxx.data-00000-of-00001
*   model-xxx.index
*   model-xxx.meta





In [None]:
!cp -r /content/gpt-2/checkpoint/run1/ /content/drive/MyDrive/Haiku/checkpoint

cp: cannot stat '/content/gpt-2/checkpoint/run1/': No such file or directory


**bold text**## Inference

In order to use our fine-tuned model for inference, we need to copy the latest models to a specific folder \</content/gpt-2/models/117M/>

In [None]:
!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/

For inference, run the following cell.

In [None]:
!python src/interactive_conditional_samples.py --nsamples=1 --top_k 40 --top_p 40 --temperature 0.01 --model_name "117M"

2021-07-29 15:13:20.859243: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 15:13:22.957280: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-07-29 15:13:23.016377: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-29 15:13:23.017007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2021-07-29 15:13:23.017065: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-07-29 15:13:23.153377: I tensorflow/stream_executor/platform/default

Run the cell below to see all the adjustable flags/options.

In [None]:
!python3 src/interactive_conditional_samples.py -- --help

python3: can't open file 'src/interactive_conditional_samples.py': [Errno 2] No such file or directory
