# Accelerate Distributed Fine-Tuning a foundation model for multiple tasks (with QLoRA)
Using the huggingface accelerate API and CML Workers, show how to set up configurations to use multiple CML workers with GPU to perform distributed training.

The following notebook is an example of performing the bundled QLoRA fine-tuning on an LLM using an instruction-following dataset distributed across multiple CML Workers. This script produces the same instruction-following adapter as shown in the amp_adapters_prebuilt directory and the CML Job "Job for fine-tuning on Instruction Dataset"

Requirements:
- Notebook Session:
  - 2 CPU / 8 MEM / 1 GPU
- GPUs:
This notebook requires access within this CML workspace for a total of 2 GPUs.
  - 1 for this Notebook Session (described above)
  - 1 for the spawned CML Worker.
- Runtime:
  - JupyterLab - Python 3.9 - Nvidia GPU - 2023.05

Note: This executes fine-tuning code defined in fine_tune_src/distributed_peft_scripts. See the implementation README in fine_tune_src/distributed_peft_scripts for a description of the fine-tuning code using huggingface transformers/trl.

### Set Training Script Path
This is the training script that will be distributed. The script itself can be run standalone or distributed with accelerate thanks to huggingface transformer and trl integration with accelerate internally.

In [1]:
train_script = "fine_tune_src/distributed_peft_scripts/task_instruction_fine_tuner.py"

## Part 0: Install Dependencies

Install dependencies for all imports used in this notebook or referenced in the distributed fine-tuning script.

In [2]:
!pip install -q --no-cache-dir -r requirements.txt

## Part 1: Generate accelerate configuration
See https://huggingface.co/docs/accelerate/quicktour for guides on how to manually set up accelerate across workers if desired

Must generate configurations for:
- NUM_WORKERS : (2) number of separate CML sessions/workers to run
- NUM_GPU_PER_WORKER : (1) GPU per CML Worker
  - See gpu_ids in accelerate configuration guide to adjust this in your accelerate config template
- MASTER_IP : The POD IP of this main CML session

These are the main variable configurations for accelerate we are concerned with to control distribution.

In [3]:
import os
NUM_WORKERS = 3
NUM_GPU_PER_WORKER = 1
MASTER_IP = os.environ["CDSW_IP_ADDRESS"]

# Set directory for all sub-workers to pull configurations from
conf_dir = "./.tmp_accelerate_configs_notebook/"
config_path_tmpl = conf_dir + "${WORKER}_config.yaml"

Different accelerate configurations are required for each accelerate worker, set that up here. 

In [4]:
import os
from string import Template

template_file = open("fine_tune_src/distributed_peft_scripts/common/accelerate_configs/accelerate_multi_config.yaml.tmpl")
template_string = template_file.read()
template_file.close()

os.makedirs(conf_dir, exist_ok=True)
for i in range(NUM_WORKERS):
    print("creating config %i" % i)
    config_file = Template(template_string)
    config_file = config_file.substitute(MACHINE_RANK=i, MAIN_SESSION_IP=MASTER_IP, NUM_MACHINES=NUM_WORKERS, NUM_PROCESSES=NUM_WORKERS)
    config_path = Template(config_path_tmpl).substitute(WORKER=i)

    new_config = open(config_path, "w")
    new_config.write(config_file)
    new_config.close()

creating config 0
creating config 1
creating config 2


## Part 2: Execute accelerate CLI command on this session and spawned workers
**Note:** This session counts as worker 0

Using the predefined fine-tuning script, launch distributed fine-tuning by launching accelerate on CML Workers.

In [5]:
# Command template to launch accross all session/workers
command_tmpl = "accelerate launch --config_file $CONF_PATH $TRAIN_SCRIPT"

To launch accelerate training in distributed mode we need to execute accelerate launch as a shell command using specific config files for each "accelerate worker".

eg. If 2 "accelerate workers" are specified then there is a worker locally in this session and we launch an additional CML Worker

eg. If 3 "accelerate workers" are specified then there is a worker locally in this session and we launch two additional CML Worker and so on

In [6]:
try:
    # Launch workers when using CML
    from cml.workers_v1 import launch_workers
except ImportError:
    # Launch workers when using CDSW
    from cdsw import launch_workers
import subprocess


# Picking CPU and MEM profile
worker_cpu = 2
worker_memory = 8

# if changing worker_gpu here, also change gpu_ids in accelerate_multi_config.yaml.tmpl
worker_gpu = 1

for i in range(NUM_WORKERS):
    # Each accelerate launch requires different configuration
    config_path = Template(config_path_tmpl).substitute(WORKER=i)
    
    # See top of notebook for where train_script comes from
    command = Template(command_tmpl).substitute(CONF_PATH=config_path, TRAIN_SCRIPT=train_script)

    # Wrapping execution into subprocess for convenience in this notebook, but this could be done manually or via CML Jobs
    # If worker num 0 this is the main process and should run locally in this session
    if i == 0:
        print("Launch accelerate locally (this session acts as worker of rank 1 aka main worker)...")
        print("\t Command: [%s]" % command)
        main_cmd = subprocess.Popen([f'bash -c "{command}" '], shell=True)

    # All other accelerate launches will use rank 1+
    else:
        print(("Launch CML worker and launch accelerate within them ..."))
        print("\t Command: [%s]" % command)
        launch_workers(name=f'LoRA Train Worker {i}', n=1, cpu=worker_cpu, memory=worker_memory, nvidia_gpu = worker_gpu,  code="!"+command + " &> /dev/null")

# Waiting for all subworkers to ready up...
main_cmd.communicate()

Launch accelerate locally (this session acts as worker of rank 1 aka main worker)...
	 Command: [accelerate launch --config_file ./.tmp_accelerate_configs_notebook/0_config.yaml fine_tune_src/distributed_peft_scripts/task_instruction_fine_tuner.py]
Launch CML worker and launch accelerate within them ...
	 Command: [accelerate launch --config_file ./.tmp_accelerate_configs_notebook/1_config.yaml fine_tune_src/distributed_peft_scripts/task_instruction_fine_tuner.py]
Launch CML worker and launch accelerate within them ...
	 Command: [accelerate launch --config_file ./.tmp_accelerate_configs_notebook/2_config.yaml fine_tune_src/distributed_peft_scripts/task_instruction_fine_tuner.py]
bin /home/cdsw/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
Downloading and preparing dataset json/teknium--GPTeacher-General-Instruct to /home/cdsw/.cache/oinopwujnkn5g05p/huggingface/datasets/teknium___json/teknium--GPTeacher-General-Instruct-3d3eb51407944fd2/0.0.0/e347ab1c93209

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]
Downloading data:   0%|          | 0.00/12.3M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 12.3M/12.3M [00:00<00:00, 109MB/s][A

Downloading data:   0%|          | 0.00/12.1M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 12.1M/12.1M [00:00<00:00, 115MB/s][A

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 12.2M/12.2M [00:00<00:00, 113MB/s][A

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 12.2M/12.2M [00:00<00:00, 107MB/s][A

Downloading data:   0%|          | 0.00/12.2M [00:00<?, ?B/s][A
Downloading data: 100%|██████████| 12.2M/12.2M [00:00<00:00, 121MB/s][A
Downloading data files: 100%|██████████| 1/1 [00:02<00:00,  2.40s/it]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 17.15it/s]
Map:   0%|          | 0/26778 [00:00<?, ? examples/s]              

Dataset json downloaded and prepared to /home/cdsw/.cache/oinopwujnkn5g05p/huggingface/datasets/teknium___json/teknium--GPTeacher-General-Instruct-3d3eb51407944fd2/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.


                                                                    

Load the base model and tokenizer...





trainable params: 385505280 || all params: 725575680 || trainable%: 53.1309538930522
Begin Training....


You're using a BloomTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'loss': 2.5563, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.4127, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.3082, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.3514, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.182, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.3055, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.3212, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.2462, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.2382, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.1217, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.1829, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 2.0968, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.0945, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.1266, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.0295, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.1909, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.1161, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 2.0723, 'learning_rate': 0.0002, '

(None, None)

## Done!
Your fine-tuned adapter is located in /home/cdsw/adapters/bloom1b1-lora-instruct

## Part 3: Inference Comparison (Base Model vs Base Model + Adapter)

### Load base model and tokenizer

In [7]:
from peft import PeftModel
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom-1b1", return_dict=True, device_map='cuda')
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-1b1")

  from .autonotebook import tqdm as notebook_tqdm



Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
bin /home/cdsw/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/cdsw/.local/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)


### Load the fine-tuned adapter for use with the base model

In [8]:

model = PeftModel.from_pretrained(model=model,                                                 # The base model to load fine-tuned adapters with
                                  model_id="/home/cdsw/adapters/bloom1b1-lora-instruct",       # The directory path of the fine-tuned adapater built in Part 1
                                  adapter_name="bloom1b1-lora-instruct",              # A label for this adapter to enable and disable on demand later
)

### Define an instruction-following test prompt

In [9]:
prompt = """<Instruction>: Classify the following items into two categories: fruits and vegetables.
<Input>: tomato, apple, cucumber, carrot, banana, zucchini, strawberry, cauliflower
<Response>:"""
batch = tokenizer(prompt, return_tensors='pt')
batch = batch.to('cuda')

#### Base Model Response

In [10]:
# Inference with base model only:
import torch
with model.disable_adapter():
    with torch.cuda.amp.autocast():
        output_tokens = model.generate(**batch, max_new_tokens=60)
    prompt_length = len(prompt)
    print(tokenizer.decode(output_tokens[0], skip_special_tokens=True)[prompt_length:])

 green, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow, red, orange, red, yellow, green, blue, yellow,


^ The base model shows no ability to follow instructions in the promp

#### Fine-tuned adapter Response

In [11]:
# Inference with fine-tuned adapter:
model.set_adapter("bloom1b1-lora-instruct")
with torch.cuda.amp.autocast():
    output_tokens = model.generate(**batch, max_new_tokens=60)
prompt_length = len(prompt)
print(tokenizer.decode(output_tokens[0], skip_special_tokens=True)[prompt_length:])

 Fruits: Tomato, Apple, Cucumber, Carrot, Banana, Zucchini, Strawberry, Cauliflower. Vegetables: Tomato, Apple, Cucumber, Carrot, Banana, Zucchini, Strawberry, Cauliflower


^ This is not a perfect response, but a good step towards a usable instruction-following LLM