<a href="https://colab.research.google.com/github/rawkintrevo/caikit-nlp/blob/no-jira-small-finetune-notebook/examples/FIne_Tuning_GPT2_md_model_with_caikit_nlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 1: Install Caikit

## Installation and Setup

In this example Jupyter notebook, we'll be using various Python libraries and pre-trained models for evaluating and analyzing natural language processing tasks. Before we proceed, we need to install the required dependencies and download some essential resources.

### 1. Installing Libraries

To begin, we'll install the following Python packages using `pip`:

- `evaluate`: A library for evaluating model performance on different NLP tasks.
- `rouge_score`: A package for calculating ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics for text summarization.

Please note that these libraries may have dependencies, so we'll ensure all the necessary requirements are met during the installation process.

```python
!pip install evaluate
!pip install rouge_score
```

### 2. Installing `caikit` and `caikit-nlp`

Next, we'll install specific versions of the caikit and caikit-nlp libraries, as the project is still in beta and breaking changes can happen.

```python
!pip install git+https://github.com/caikit/caikit@v0.11.3
!pip install git+https://github.com/caikit/caikit-nlp
```

### 3. Downloading Additional Resources

In order to explore the capabilities of pre-trained models, we'll need to download the caikit-nlp repository.


```python
!git clone https://github.com/caikit/caikit-nlp
```

Now that we have all the necessary libraries and resources installed, we can move on to the next steps in our NLP analysis using these powerful tools!

In [None]:
!pip install evaluate
!pip install rouge_score

!pip install git+https://github.com/caikit/caikit@v0.11.3
!pip install git+https://github.com/caikit/caikit-nlp

!git clone https://github.com/caikit/caikit-nlp

Collecting evaluate
  Downloading evaluate-0.4.0-py3-none-any.whl (81 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━[0m [32m61.4/81.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.4/81.4 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
Collecting dill (from evaluate)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash (from evaluate)
  Downloading xxhash-3.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (194

# Step 2. Fine tuning Tuning

```
!python caikit-nlp/examples/run_fine_tuning.py --dataset "billsum" \
  --model_name gpt2-medium \
  --num_epochs 10 \
  --output_dir tmp/gpt2-md \
  --batch_size=8 \
  --accumulate_steps 32 \
  --max_source_length 512 \
  --metric rouge \
  --torch_dtype bfloat16 \
  --evaluate
```

This is a command-line instruction to run a Python script called `run_fine_tuning.py` using the python interpreter. It is part of the `caikit-nlp` package and is meant to tune a pre-trained model (in this case gpt2-mediudm) on a specific dataset (in this case `billsum`).

Let's explain each argument in the command:

1. `caikit-nlp/examples/run_peft_tuning.py`: This specifies the path to the Python script that will be executed. It is a part of the caikit-nlp library and contains the implementation of the PEFT approach.
1. `--dataset "billsum"`: This specifies the dataset to be used for tuning. In this example, the dataset is `billsum` which is a summary of US Congressional and California State Bills [link](https://huggingface.co/datasets/billsum).
1. `--model_name gpt2-md`: This indicates the base model that will be used for prompt tuning. In this case, it's `gpt2-md`, which refers to the GPT2-Medium model from Hugging Face [link](https://huggingface.co/gpt2-medium).
1. `--num_epochs 10`: This sets the number of epochs (training iterations) for the prompt-tuning process. Here, it's set to 10, meaning the model will go through the dataset ten times during fine-tuning.
1. `--output_dir tmp/gpt2-md`: This sets the directory where the prompt-tuned model and related outputs will be stored. In this case, it's set to the `tmp/gpt2-md` directory.
1. `--batch_size=8`: This sets the batch size used during training. The data will be divided into batches of 8 samples each.
1. `--accumulate_steps 32`: This specifies the number of steps before gradients are accumulated and the weights are updated. It can be useful for larger batch sizes when the GPU memory is limited.
1. `--max_source_length 512`: This is a flag to set the maximum length of the input sequence.
1. `--metric rouge`: This sets the evaluation metric to ROUGE.
1. `--torch_dtype bfloat16`: This specifies what dtype to use for the training. `float32` is considered 'full precision', though other options such as `float16` and `bfloat16` (known as half precision) also exist. `bfloat16` has wider range, but less precision than `float16` but is only available on Ampre class GPUs (such as the A100)
1. `--evaluate` this signals to the script to evaluate the model at the end of finetuning.


Overall, this command line script is fine-tuning the gpt2-medium model on the `billsum` dataset, with specific settings for maximum source length, batch size, accumulation steps, and so on. For a full list of available args, their descriptions, and default values, run `!python caikit-nlp/examples/run_fine_tuning.py --help`
        
The results of prompt-tuning will be stored in the `tmp/gpt2-md` directory.

# A 'Small' Model

The first model we'll fine tune is `gpt2-medium` ([link](https://huggingface.co/gpt2-medium)). This model has 355M parameters. We'll fine tune it on the `billsum` dataset and then see how it performs on summarizing new bills.


In [1]:
%env ALLOW_DOWNLOADS=true
!python caikit-nlp/examples/run_fine_tuning.py --dataset "billsum" \
  --model_name gpt2-medium \
  --num_epochs 10 \
  --output_dir tmp/gpt2-md \
  --batch_size=8 \
  --accumulate_steps 32 \
  --max_source_length 512 \
  --metric rouge \
  --torch_dtype bfloat16 \
  --evaluate

env: ALLOW_DOWNLOADS=true
python3: can't open file '/content/caikit-nlp/examples/run_fine_tuning.py': [Errno 2] No such file or directory


## GPU Memory footprint: 27.1GB

## Run Time: ~13 hours (for 10 epochs... probably way overkill).

## Next: Persist our model to gDrive



In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
# Move Model to gDrive for later usage/retreival
!cp -r tmp/gpt2-md /content/gdrive/MyDrive/LLMs/gpt2-md

## Playing with the model

What's the point of training LLMs if you aren't going to play with them?

We'll _try_ to use our model to summarize a bill from the Illinois house that makes it explicity **legal** for bicycle to treat stop signs as yield signs so long as

1. No other traffic that would cause a hazzard is present
2. They slow down a bit to actually look for a hazzard
3. It's not a railroad crossing.

Issues we can anticipate ahead of time:

GPT2 was trained to take a prompt and continue it- it was not trained to summarize things. Also this model is about two orders of magnitude smaller than a modern LLM like GPT3.5 so let's not get our hopes too high.

### The actual synopsis of the bill

> Amends the Illinois Vehicle Code. Defines "immediate hazard". Provides instances in which an individual operating a bicycle approaching a stop sign may proceed through the intersection without stopping at the stop sign.

In [None]:
from transformers import pipeline, set_seed

generator = pipeline('text-generation', model='tmp/gpt2-md/artifacts')

set_seed(42)



In [None]:
text = """summarize:
  Be it enacted by the People of the State of Illinois,
represented in the General Assembly:

Section 5. The Illinois Vehicle Code is amended by adding
Section 11-1511.5 as follows:

(625 ILCS 5/11-1511.5 new)
Sec. 11-1511.5. Operation of bicycle approaching a stop
sign.
		(a) As used in this Section, "immediate hazard" means a
vehicle approaching an intersection at a proximity and rate of
speed sufficient to indicate to a reasonable person that there
is a danger of collision or accident.
    (b) Except as provided in subsection (c), an individual
operating a bicycle approaching a stop sign may proceed
through the intersection without stopping at the stop sign if:
        (1) the individual slows to a reasonable speed; and
        (2) the individual yields the right-of-way to:
            (i) any pedestrian within the intersection or an
        adjacent crosswalk;
            (ii) other traffic within the intersection; and
            (iii) oncoming traffic that poses an immediate
        hazard during the time the individual is traveling
        through the intersection.




HB3923	- 2 -	LRB103 26384 MXP 52747 b
    (c) Subsection (b) does not apply to an intersection with
an active railroad grade crossing.
"""

resp = generator(text, max_length=756, num_return_sequences=3)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
print(resp[0]['generated_text'].replace(text,''))

func>The State Motor Vehicle
Vehicle Code governs the operation of vehicles by every
regarding a stop sign.
func>
func>4. Except as provided in subsection (d), an individual
operating a bicycle approaching a stop sign may proceed
through the intersection without stopping at the stop sign
if:
func>
func> 4. (d) There is not traffic within the intersection or an     adjacent crosswalk
that poses an immediate   the intersection is open.
func>a. (iii) 会(l) †The Stop, oncoming Traffic oncoming Traffic
is a pedestrian within the intersection, whether or not
is within the intersection is in
the path of another vehicle approaching the stop sign by a vehicle that requires
the stop sign position is in a manner that may cause the
individual to yield the right of way to:
func>a. (ii) other traffic within the intersection. (iii) The intersection is open.
Functionality
func>An intersection is an intersection with an active railroad grade crossing, which includes
the railroad grade crossing adjacent 

## Conclusion

It doesn't do horrible. Some strange characters and formatting, and we see that it wants to start with `4. (d)` as that 'continues' the prompt, but the output is not wildly off base.