# Finetuning LLMs

In this notebook we will be making use of Anyscale's LLMForge to finetune our first LLM model. 

<div class="alert alert-block alert-info">
<b> Here is the roadmap for this notebook:</b>
<ul>
    <li><b>Part 0:</b> Why finetune LLMs?</li>
    <li><b>Part 1:</b> Introduction to LLMForge</li>
    <li><b>Part 2:</b> Submitting an LLM Finetuning Job</li>
    <li><b>Part 3:</b> Tracking the Progress of the Job</li>
    <li><b>Part 4:</b> Tailoring LLMForge to Your Needs</li>
</ul>
</div>

## Imports

In [None]:
import anyscale

## 0. Why finetune LLMs?

The main usecase for finetuning LLMs is to adapt a pre-trained model to a specific task or dataset.

- **Task-Specific Performance**: Fine-tuning hones an LLM's capabilities for a particular task, leading to superior performance.
- **Resource Efficiency**: We can use smaller LLMs that require less computational resources to achieve better performance than larger general-purpose models.
- **Privacy and Security**: We can self-host finetuned models to ensure that our data is not shared with third parties.

In this guide, we will be finetuning an LLM model on a custom video gaming dataset. 

The task is a functional representation task where we want to extract structured data from user input on video games.

## 1. Introduction to LLMForge

<!-- get one liner from docs -->
Anyscale's [LLMForge](https://docs.anyscale.com/llms/finetuning/intro/#what-is-llmforge) provides an easy to use library for fine-tuning LLMs.

<!-- add diagram on how to work with LLMForge -->
Here is a diagram that shows a *typical workflow* when working with LLMForge:


<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/e2e-llms/llmforge-finetune-workflow-v3.png" width=800>

### Preparing an LLMForge configuration file

We have already prepared a configuration file for you under `configs/training/lora/mistral-7b.yaml`

Here are the file contents:

```yaml
# Change this to the model you want to fine-tune
model_id: mistralai/Mistral-7B-Instruct-v0.1

# Change this to the path to your training data
train_path: s3://anyscale-public-materials/llm-finetuning/viggo_inverted/train/subset-500.jsonl

# Change this to the path to your validation data. This is optional
valid_path: s3://anyscale-public-materials/llm-finetuning/viggo_inverted/valid/data.jsonl

# Change this to the context length you want to use. Examples with longer
# context length will be truncated.
context_length: 512

# Change this to total number of GPUs that you want to use
num_devices: 2

# Change this to the number of epochs that you want to train for
num_epochs: 3

# Change this to the batch size that you want to use
train_batch_size_per_device: 16
eval_batch_size_per_device: 16

# Change this to the learning rate that you want to use
learning_rate: 1e-4

# This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
padding: "longest"

# By default, we will keep the best checkpoint. You can change this to keep more checkpoints.
num_checkpoints_to_keep: 1

# Deepspeed configuration, you can provide your own deepspeed setup
deepspeed:
  config_path: configs/deepspeed/zero_3_offload_optim+param.json

# Lora configuration
lora_config:
  r: 8
  lora_alpha: 16
  lora_dropout: 0.05
  target_modules:
    - q_proj
    - v_proj
    - k_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
    - embed_tokens
    - lm_head
  task_type: "CAUSAL_LM"
  bias: "none"
  modules_to_save: []

````

<!-- LLMForge config explained -->
Anyscale's LLMForge's finetune config can be split into the following:

- **Model Configuration:**
    - `model_id`: The Hugging Face model name.
    
- **Data Configuration:**
    - `train_path`: The path to the training data.
    - `valid_path`: The path to the validation data.
    - `context_length`: The maximum number of tokens in the input.
    
- **Training Configuration:**
    - `learning_rate`: The learning rate for the optimizer.
    - `num_epochs`: The number of epochs to train for.
    - `train_batch_size_per_device`: The batch size per device for training.
    - `eval_batch_size_per_device`: The evaluation batch size per device.
    - `num_devices`: The number of devices to train on.
    
- **Output Configuration:**
    - `num_checkpoints_to_keep`: The number of checkpoints to retain.
    - `output_dir`: The output directory for the model outputs.

- **Advanced Training Configuration:**
    - **LoRA Configuration:**
        - `lora_config`: The LoRA configuration. Key parameters include:
            - `r`: The rank of the LoRA matrix.
            - `target_modules`: The modules to which LoRA will be applied.
            - `lora_alpha`: The LoRA alpha parameter (a scaling factor).
    - **DeepSpeed Configuration:**
        - `deepspeed`: Settings for distributed training strategies such as DeepSpeed ZeRO (Zero Redundancy Optimizer).
            - This may include specifying the ZeRO stage (to control what objects are sharded/split across GPUs).
            - Optionally, enable CPU offloading for parameter and optimizer states.
    

Default configurations for all popular models are available in the `llm-forge` library, which serve as a good starting point for most tasks.


## 2. Submitting an LLM Finetuning Job

To run the finetuning, we will be using the Anyscale Job SDK.

We start by defining a JobConfig object with the following content:

In [None]:
job_config = anyscale.job.JobConfig(
    # The command to run the finetuning process
    entrypoint="llmforge anyscale finetune configs/training/lora/mistral-7b.yaml",
    # The image to use for the job
    image_uri="localhost:5555/anyscale/llm-forge:0.5.4",
    # Retry the job up to 1 times
    max_retries=1
)

We can then run the following command to submit the job:

In [None]:
job_id = anyscale.job.submit(config=job_config)
job_id

<div class="alert alert-warning">

<b>Note:</b> by default the job will make use of the same compute configuration as the current workspace that is submitting the job unless specified otherwise.

</div>

## 3. Tracking the Progress of the Job

Once the job is submitted, we can make use of the observability features of the Anyscale platform to track the progress of the job at the following location: https://console.anyscale.com/jobs/{job_id}

More specifically, we can inspect the following:
- Logs to view which stage of the finetuning process the job is currently at.
- Hardware utilization metrics to ensure that the job is making full use of the resources allocated to it.
- Training metrics to see how the model is performing on the validation set.

If you would like to follow the job logs in real-time, you can run the following command:

```bash
!anyscale job logs --id {job_id} -f
```

If you head to the Job's dashboard, you can see the hardware utilization metrics showcasing the GPU utilization and the memory usage:

<img src="https://anyscale-public-materials.s3.us-west-2.amazonaws.com/ray-summit/e2e-llms/hardware-utilization-metrics-v2.jpg" width=800>


Under the job's log tab, you can see a snippet of the logs showcasing the training metrics:

```
2024-09-04, 17:36:21.824	driver	╭───────────────────────────────────────────────╮
2024-09-04, 17:36:21.824	driver	│ Training result                               │
2024-09-04, 17:36:21.824	driver	├───────────────────────────────────────────────┤
2024-09-04, 17:36:21.824	driver	│ checkpoint_dir_name                           │
2024-09-04, 17:36:21.824	driver	│ time_this_iter_s                      9.07254 │
2024-09-04, 17:36:21.824	driver	│ time_total_s                          414.102 │
2024-09-04, 17:36:21.824	driver	│ training_iteration                         29 │
2024-09-04, 17:36:21.824	driver	│ avg_bwd_time_per_epoch                        │
2024-09-04, 17:36:21.824	driver	│ avg_fwd_time_per_epoch                        │
2024-09-04, 17:36:21.824	driver	│ avg_train_loss_epoch                          │
2024-09-04, 17:36:21.824	driver	│ bwd_time                              5.13469 │
2024-09-04, 17:36:21.824	driver	│ epoch                                       1 │
2024-09-04, 17:36:21.824	driver	│ eval_loss                                     │
2024-09-04, 17:36:21.824	driver	│ eval_time_per_epoch                           │
2024-09-04, 17:36:21.824	driver	│ fwd_time                              3.94241 │
2024-09-04, 17:36:21.824	driver	│ learning_rate                           5e-05 │
2024-09-04, 17:36:21.824	driver	│ num_iterations                             13 │
2024-09-04, 17:36:21.824	driver	│ perplexity                                    │
2024-09-04, 17:36:21.824	driver	│ step                                       12 │
2024-09-04, 17:36:21.824	driver	│ total_trained_steps                        29 │
2024-09-04, 17:36:21.824	driver	│ total_update_time                     268.125 │
2024-09-04, 17:36:21.824	driver	│ train_loss_batch                      0.28994 │
2024-09-04, 17:36:21.824	driver	│ train_time_per_epoch                          │
2024-09-04, 17:36:21.824	driver	│ train_time_per_step                   9.07861 │
2024-09-04, 17:36:21.824	driver	│ trained_tokens                         280128 │
2024-09-04, 17:36:21.824	driver	│ trained_tokens_this_iter                10752 │
2024-09-04, 17:36:21.824	driver	│ trained_tokens_throughput             1044.76 │
2024-09-04, 17:36:21.824	driver	│ trained_tokens_throughput_this_iter   1184.51 │
2024-09-04, 17:36:21.824	driver	╰───────────────────────────────────────────────╯
2024-09-04, 17:36:21.824	driver	(RayTrainWorker pid=2484, ip=10.0.32.0) [epoch 1 step 12] loss: 0.28619903326034546 step-time: 9.077147483825684
```


Note, you can also run tools like tensorboard to visualize the training metrics.

## 4. Tailoring LLMForge to Your Needs

### 1. Start with a default configuration

Use the Anyscale [finetuning LLMs template](https://console.anyscale.com/v2/template-preview/finetuning_llms_v2) which contains a default configuration for the most common models.

### 2. Customize to point to your data

Use the `train_path` and `valid_path` to point to your data. Update the `context_length` to fit your expected sequence length.

### 3. Run the job and monitor for performance bottlenecks

Here are some common performance bottlenecks:

#### Minimize GPU communication overhead
If you can secure a large instance and perform the finetuning on a single node, then this will be advisable to reduce the communication overhead during distributed training. You can specify a larger node instances by setting a custom compute configuration in the `job.yaml` file.

#### Maximize GPU memory utilization

The following parameters affect your GPU memory utilization

1. The batch size per device
2. The chosen context length
3. The padding type

In addition, other configurations like deepspeed will also have an effect on your memory.

You will want to tune these parameters to maximize your hardware utilization.

<div class="alert alert-warning">

<b> Note:</b> For an advanced tuning guide check out [this guide here](https://docs.anyscale.com/canary/llms/finetuning/guides/optimize_cost/)

</div>

## Next Steps

We jumped directly into finetuning an LLM but in the next notebooks we will cover the following topics:

1. How did we prepare the data for finetuning?
2. How should we evaluate the model?
3. How do we deploy the model?
