#  1. Model 3 - GPT-2 117M  From scratch model 

1) Make sure to enable GPU -> Edit > Notebook Settings > Hardware accelarator

To start:
* Execute all cells belonging to step 1

Then, to generate sample texts:
* Execute cells belonging to step 2

Or, to fine-tune the model:
* Execute cells belonging to step 3

**Note:** Colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint.

### 1.1 Mount Google Drive
Mount drive to access google drive for saving and accessing checkpoints later. Have to log in to your google account

In [0]:
from google.colab import drive
drive.mount('/content/drive')

## Step 1.2 Verify GPU

Colaboratory now uses an Nvidia T4 GPU, which is slightly faster than the old Nvidia K80 GPU for training GPT-2, and has more memory allowing you to train the larger GPT-2 models and generate more text. However sometimes the K80 will still be used.

You can verify which GPU is active by running the cell below.

In [0]:
!nvidia-smi

## Step 1.3 Clone custom repo
Clone the custom made git repository to fine-tune GPT-2 from scratch. Install the requirements in addition.

In [0]:
!git clone https://github.com/zhemann/gpt-2.git

In [0]:
cd gpt-2

In [0]:
!pip3 install -r requirements.txt

## Step 1.4 Download, build and install Sentencepiece
See also [sentencepiece](https://github.com/google/sentencepiece)


In [0]:
!pip3 install sentencepiece

In [0]:
cd /content

In [0]:
%%bash -e
if ! [[ -f ./spm_train ]]; then
  wget https://github.com/google/sentencepiece/archive/v0.1.82.zip
  unzip v0.1.82.zip
fi

In [0]:
% cd sentencepiece-0.1.82
% mkdir build
% cd build

In [0]:
!cmake ..

In [0]:
!make -j $(nproc)

In [0]:
!sudo make install

In [0]:
!sudo ldconfig -v

## Step 1.5 Set Python IO Encoding

In [0]:
!export PYTHONIOENCODING=UTF-8

## Step 1.6 Load trained model
Load the trained model. Creates directory 'models' first.

In [0]:
cd /content/gpt-2

In [0]:
mkdir models

Load model from drive into 'models' directory

In [0]:
!cp -r /content/drive/My\ Drive/checkpoint_from_scratch/117MSP /content/gpt-2/models/

# 2. Generating sample texts
To generate samples, please stick to this section.

See section **3. Fine-tuning the model** if you want to fine-tune the model.

## Step 2.1 Generate samples
Generate conditional samples from the model given a prompt you provide - change top-k hyperparameter if desired (default is 40).  

In [0]:
cd /content/gpt-2

In [0]:
!python3 src/interactive_conditional_samples.py --top_k 40 --temperature 0.5 --length 300 --model_name '117MSP' --nsamples 5

To check flag descriptions, use:

In [0]:
!python3 src/interactive_conditional_samples.py -- --help

# 3. Fine-tuning the model
This section can be used to fine-tune the current model. Since the '117MSP'  directory already contains the encoded datasets, we do not need a separate command to copy the encoded datasets from Google Drive to the Colaboratory VM, but can start training the model right away.


## Step 3.1 Train model
Start training and save model to Google Drive afterwards.

In [0]:
cd /content/gpt-2

In [0]:
!PYTHONPATH=src ./train.py --dataset models/117MSP/dataset_columns_enc.npz --model_name '117MSP' --steps 5000 --sample_every 1000 --save_every 4000 --learning_rate 2.5e-4 --run_name run1 

In [0]:
!cp -r /content/gpt-2/models/117MSP/ /content/drive/My\ Drive/checkpoint_from_scratch/

# 4 Training SentencePiece and encoding the dataset
This section is dedicated to show you how the trained SentencePiece model is trained (the vocabulary files) as well as how the raw datasets are concatenated and encoded. 

 

## Step 4.1 Copy dataset
1.   Create directory '`data`' within '`gpt-2`' directory
2.   Copy either one, two or three datasets from Google Drive into the '`data`' directory by running the corresponding cells



In [0]:
cd /content/gpt-2

In [0]:
mkdir data

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/wiki_raw.txt /content/gpt-2/data/

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/columns_raw.txt /content/gpt-2/data/

In [0]:
!cp -r /content/drive/My\ Drive/data_from_scratch/books_raw.txt /content/gpt-2/data/

## Step 4.2 Create dictionary files
Within these step, the following actions are performed:
1. Combine all .txt-files in directory gpt-2/data into one large .txt-file.
2. Create dictionary files based on large .txt-file

In [0]:
cd /content/gpt-2

First, we run concat.sh to create one dataset from multiple files and to add custom newline tokens <|n|> to the datasets. This is necessary as SentencePiece does not add such a token to the dictionairy automatically.

In [0]:
!sh scripts/concat.sh data datasets_combined.txt

Then, we copy the generated text-file to the '`data`' directory

In [0]:
!cp datasets_combined.txt data/

The next cell performs the actual training of the SentencePiece model takes place. This process will create the following three files:


1. hparams.json
2. sp.model
3. sp.vocab



In [0]:
!sh scripts/createspmodel.sh data/datasets_combined.txt 40000

In [0]:
mkdir models; cd models; mkdir 117MSP_Test;

In [0]:
cd /content/gpt-2

We created a new directory called ''`117MSP_Test`' in the previous step, and in addition we add the three files to this directory.

In [0]:
!cp hparams_117M.json models/117MSP_Test/hparams.json
!cp sp.model models/117MSP_Test/
!cp sp.vocab models/117MSP_Test/

## Step 4.3 Encoding the datasets 
The next cell will encode the raw text-file '`datasets_combined`' by the vocabulary files we just trained for model ''`117MSP_Test`'

In [0]:
!sh scripts/encode.sh data/datasets_combined.txt 117MSP_Test dataset_books_columns_enc.npz

Normally, you would want to copy the encoded datasets to your Google Drive for re-using it.