#  Fine-tuning of a GPT-2 Text-Generating Model on GPU
## using the [_aitextgen_](https://github.com/minimaxir/aitextgen) library (v0.5.2) by Max Woolf
(This Notebook is adapted from an [original](https://colab.research.google.com/drive/15qBZx5y9rdaQSyWpsreMDnTiZ5IlN0zD?usp=sharing) by the same author.)


<b>NOTE</b>: If you want to use this Notebook, `copy it to your Google Drive`, `open it in Google Colaboratory` and `run the cells below`.


In [None]:
#cell 1

!pip uninstall tensorboard 
!pip install tensorboard==2.3.0
!pip install PyYAML==5.4.1

In [None]:
#cell 2

!pip install -q aitextgen

import logging
logging.basicConfig(
        format="%(asctime)s — %(levelname)s — %(name)s — %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO
    )

from aitextgen import aitextgen
from aitextgen.colab import mount_gdrive, copy_file_from_gdrive

## <b>The advantage of using Gooogle Colaboratory is that it provides a GPU</b> (usually an Nvidia P4, an Nvidia T4, an Nvidia P100, or an Nvidia V100).

To finetune a GPT-2 124M for text generation purposes, T4 and P100 are the best choices since they have more VRAM, which will allow zou to **enable `fp16=True` during training for faster/more memory efficient training.**

To verify which GPU is active,  run the cell below. To obtain a different GPU, go to **Runtime -> Factory Reset Runtime**.

In [None]:
#cell 3

!nvidia-smi

#NB: Here I assume you chose to use Google Colaboratory. This cell won't work if this Notebook is run outside the ColabVM.  

## Loading the GPT-2

First of all, you need to download the GPT-2 model into the GPU. 

There are several sizes of GPT-2:

* `124M` (default): the "small" model, 500MB on disk.
* `355M` (default): the "medium" model, 1.5GB on disk.
* `774M` (default): the "large" model, 3GB on disk.

For the <i>Karl Marx's Press Review</i> project, I used the small model, as after the fine-tuning on Google Colaboratory my laptop's humble CPU can handle that one better and faster for text generation. Users with more powerful computers may be more ambitious!

The next cell downloads the model and saves it in the Colaboratory VM. If the model has already been downloaded, running this cell will reload it.

In [None]:
#cell 4
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

## Mounting Google Drive

As the author of _aitextgen_ suggests, “the best way to get input text to-be-trained into the Colaboratory VM, and to get the trained model *out* of Colaboratory, is to route it through Google Drive *first*”.

Here I again assume you chose to use Colab, as this cell will only work in that environment. This code can be used to **mount your personal Google Drive in the VM**, which later cells can use to get data in/out. 

**NB:** You will be asked for an authorisation code! 

In [None]:
#cell 4

mount_gdrive()

## Uploading a .txt file containing the training dataset from your local machine to Colaboratory.
#### (Also works with single-column CSV files) 



In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/w3wvHhR.png)

### OPTION 1:
If your text file is **small**, upload it in Colaboratory directly from zour local machine, update the file name in the cell below, then run the cell.

**NB**: Keep in mind that the GPT2 has a limit of 1024 characters per text sample, so make sure you clean your dataset accordingly. If you are repeating my exercise on the Marx-Engels dataset, this is the time to run the Python module `scraper_preprocesser.py` that may be found in the folder `data_and_model` (check out the requirements first). This will produce a `marx.txt` file, which is already optimised to be used for the fine-tuning.

In [None]:
#cell 5 

file_name = "marx.txt"

In [None]:
#cell 6

from google.colab import drive
drive.mount('/content/drive')

### OPTION 2: 
If your text file is **large (>10MB**), do not upload it from your machine! Instead, **load that file to Google Drive first**, then copy that file from Google Drive to the Colaboratory VM.

(As an alternative, _aitextgen_ also implements a function to [compress your dataset to a cache](https://docs.aitextgen.io/dataset/) on your local computer. At that point you can upload the resulting `dataset_cache.tar.gz` to Colab and set the `file_name`in `cell 5` to that.)

In [None]:
#cell 7

copy_file_from_gdrive(file_name)

## Finetune the GPT-2

As Max Woolf explains in his original Notebook, the next cell will “start the actual finetuning of GPT-2 in _aitextgen_. It runs for `num_steps`, and a progress bar will appear to show training progress, current loss (the lower the better the model), and average loss (to give a sense on loss trajectory). The model will be saved every `save_every` steps in `trained_model` by default, and when training completes.”

**NB**: If you mounted your Google Drive (and you have done it, haven't you?...the process may be long or time out after some hours..._you definitely don't want to lose your progress_), the model will _also_ be saved there in a unique folder.

#### **Below I copy and paste the documentation of _aitextgen_ to set the `train()` parameters for GPT2:**

- **`line_by_line`**: Set this to `True` if the input text file is a single-column CSV, with one record per row. _aitextgen_ will automatically process it optimally.
- **`from_cache`**: If you compressed your dataset locally (as noted in the previous section) and are using that cache file, set this to `True`.
- **`num_steps`**: Number of steps to train the model for.
- **`generate_every`**: Interval of steps to generate example text from the model; good for qualitatively validating training.
- **`save_every`**: Interval of steps to save the model: the model will be saved in the VM to `/trained_model`.
- **`save_gdrive`**: Set this to `True` to copy the model to a unique folder in your Google Drive, if you have mounted it in the earlier cells
- **`fp16`**: Enables half-precision training for faster/more memory-efficient training. **Only works on a T4 or V100 GPU.**

Here are other important parameters for `train()` that are useful but you likely do not need to change.

- **`learning_rate`**: Learning rate of the model training.
- **`batch_size`**: Batch size of the model training; setting it too high will cause the GPU to go OOM. (if using `fp16`, you can increase the batch size more safely)

In [None]:
#cell 8

ai.train(file_name,
         line_by_line=False,
         from_cache=False,
         num_steps=3000,
         generate_every=1000,
         save_every=1000,
         save_gdrive=True,
         learning_rate=1e-3,
         fp16=True,
         batch_size=1, 
         )

## Finetuning complete!

When the process is finished, you will find a new `trained model` folder in the Colaboratory files to the **left** of your screen (same place where you had uploaded the training dataset). 

The folder is quite heavy and contains two files (`pytorch_model.bin` and `config.json`). To use it within the _Karl Marx's Press Review_ project, **download its content and copy it** <u>as it is</u> into the <u>two subfolders</u> `trained_model` that may be found in:
- `marxist_press_review/article_collector/` and 
- `marxist_press_review/press_review_app/`

This will allow the modules contained in these folders to run.

But before doing that, and in case you were using this code for finetuning a GPT2 on **your own dataset**, I strongly advice to run the following cells as a test for what you have done. <u>Bad results may lead you to rework your dataset</u>, or to change the parameters in `cell 8`.


The next cell will allow you to load the retrained model + metadata necessary to generate text.

In [None]:
#cell 9

ai = aitextgen(model_folder="trained_model", to_gpu=True)

## Generate Text From The Trained Model

After you've trained the model or loaded a retrained model from checkpoint, you can now generate text.

**If you just trained a model**, you'll get much faster training performance if you reload the model; the next cell will reload the model you just trained from the `trained_model` folder.

In [None]:
#cell 10

ai = aitextgen(model_folder="trained_model", to_gpu=True)

Check out the [_aitextgen_ documentation](https://docs.aitextgen.io/generate/) for the available options when generating text.  Here, for mere testing purpose, I have chosen to generate multiple texts at a time (this can be done by specifing the **`n`** parameter). You can pass a **`batch_size`** to generate multiple samples in parallel (this parameter can be set up to 50 in Colaboratory). If provided, the **`prompt`** parameter allows to force the beginning of the sentence to a certain string.

Other optional-but-helpful parameters for `ai.generate()`, taken from the _aitextgen_ documentation:

*  **`min length`**: The minimum length of the generated text: if the text is shorter than this value after cleanup, aitextgen will generate another one.
*  **`max_length`**: Number of tokens to generate (default 256, you can generate up to 1024 tokens with GPT-2 and 2048 with GPT Neo)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)

In [None]:
#cell 11

ai.generate(n=5,
            batch_size=3,
            prompt="Wages are growing",
            min_length=10,
            max_length=100,
            temperature=0.7,
            top_p=0.9
            )

## **AND THAT'S IT!** 
Below I paste the licence included in the original Notebook by Max Woolf from which I have adapted my code.


---

#### **MIT License**

Copyright (c) 2020-2021 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.