# Example Dsitributed accelerate on CML
Using the huggingface accelerate API and CML Workers, show how to set up configurations to use multiple CML workers with GPU to perform distributed training.

Training Script: https://github.com/huggingface/accelerate/blob/main/examples/nlp_example.py
Script README: https://github.com/huggingface/accelerate/tree/main/examples#simple-nlp-example Simple NLP example
->multi GPUs

### Download training example from huggingface/accelerate git

In [1]:
!curl https://raw.githubusercontent.com/huggingface/accelerate/main/examples/nlp_example.py -o nlp_example.py
train_script = "nlp_example.py"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8260  100  8260    0     0  41717      0 --:--:-- --:--:-- --:--:-- 41717


### pip install prereqs 
(for all imports used here or in the download training script example)

In [9]:
! pip install torch datasets evaluate transformers accelerate scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.3.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.9/10.9 MB[0m [31m40.2 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
Collecting joblib>=1.1.1 (from scikit-learn)
  Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.2/302.2 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.2.0-py3-none-any.whl (15 kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.3.2 scikit-learn-1.3.0 threadpoolctl-3.2.0


### Generate accerlate config strings
See https://huggingface.co/docs/accelerate/quicktour for guides on how to manually set up accelerate across workers if desired

Generate config for:
- NUM_WORKERS : number of separate CML sessions/workers to run
- NUM_GPU_PER_WORKER : 1 GPU per CML Worker
  - See gpu_ids in accelerate configuration guide to adjust this in your accelerate config template
- MASTER_IP : The POD IP of this main CML session

In [4]:
import os
NUM_WORKERS = 2
NUM_GPU_PER_WORKER = 1
MASTER_IP = os.environ["CDSW_IP_ADDRESS"]


conf_dir = "./.tmp_accelerate_configs/"
config_path_tmpl = conf_dir + "${WORKER}_config.yaml"

In [5]:
import os
from string import Template

template_file = open("accelerate_multi_config.yaml.tmpl")
template_string = template_file.read()
template_file.close()

os.makedirs(conf_dir, exist_ok=True)
for i in range(NUM_WORKERS):
    print("creating config %i" % i)
    config_file = Template(template_string)
    config_file = config_file.substitute(MACHINE_RANK=i, MAIN_SESSION_IP=MASTER_IP, NUM_MACHINES=NUM_WORKERS, NUM_PROCESSES=NUM_WORKERS)
    config_path = Template(config_path_tmpl).substitute(WORKER=i)

    new_config = open(config_path, "w")
    new_config.write(config_file)
    new_config.close()

creating config 0
creating config 1


### Execute accelerate CLI command on this session and spawned workers
**Note:** This session counts as worker 0

In [6]:
# Command template to launch accross all session/workers
command_tmpl = "accelerate launch --config_file $CONF_PATH $TRAIN_SCRIPT"

To launch accelerate training in distributed mode we need to execute accelerate launch as a shell command using specific config files for each "accelerate worker".

If 2 "accelerate workers" are specified then there is a worker locally in this session and we launch an additional CML Worker

If 3 "accelerate workers" are specified then there is a worker locally in this session and we launch two additional CML Worker and so on

In [10]:
from cml.workers_v1 import launch_workers
import subprocess


# Picking CPU and MEM profile
worker_cpu = 2
worker_memory = 8

# if changing worker_gpu here, also change gpu_ids in accelerate_multi_config.yaml.tmpl
worker_gpu = 1

for i in range(NUM_WORKERS):
    # Each accelerate launch requires different configuration
    config_path = Template(config_path_tmpl).substitute(WORKER=i)
    
    # See top of notebook for where train_script comes from
    command = Template(command_tmpl).substitute(CONF_PATH=config_path, TRAIN_SCRIPT=train_script)

    # Wrapping execution into subprocess for convenience in this notebook, but this could be done manually or via CML Jobs
    # If worker num 0 this is the main process and should run locally in this session
    if i == 0:
        print("Launch accelerate locally (this session acts as worker of rank 1 aka main worker)...")
        print("\t Command: [%s]" % command)
        main_cmd = subprocess.Popen([f'bash -c "{command}" '], shell=True)

    # All other accelerate launches will use rank 1+
    else:
        print(("Launch CML worker and launch accelerate within them ..."))
        print("\t Command: [%s]" % command)
        launch_workers(name=f'LoRA Train Worker {i}', n=1, cpu=worker_cpu, memory=worker_memory, nvidia_gpu = worker_gpu,  code="!"+command + " &> /dev/null")

# Waiting for all subworkers to ready up...
main_cmd.communicate()

Launch accelerate locally (this session acts as worker of rank 1 aka main worker)...
	 Command: [accelerate launch --config_file ./.tmp_accelerate_configs/0_config.yaml nlp_example.py]
Launch CML worker and launch accelerate within them ...
	 Command: [accelerate launch --config_file ./.tmp_accelerate_configs/1_config.yaml nlp_example.py]


Downloading (…)okenizer_config.json: 100%|██████████| 29.0/29.0 [00:00<00:00, 3.17kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 84.5kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 213k/213k [00:00<00:00, 8.51MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 436k/436k [00:00<00:00, 2.17MB/s]
Downloading builder script: 100%|██████████| 28.8k/28.8k [00:00<00:00, 21.9MB/s]
Downloading metadata: 100%|██████████| 28.7k/28.7k [00:00<00:00, 21.5MB/s]
Downloading readme: 100%|██████████| 27.9k/27.9k [00:00<00:00, 23.8MB/s]
Found cached dataset glue (/home/cdsw/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
100%|██████████| 3/3 [00:00<00:00, 72.75it/s]
Downloading model.safetensors: 100%|██████████| 436M/436M [00:01<00:00, 286MB/s]  
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['cla

epoch 0: {'accuracy': 0.7352941176470589, 'f1': 0.8373493975903614}
epoch 1: {'accuracy': 0.7965686274509803, 'f1': 0.8676236044657097}
epoch 2: {'accuracy': 0.8382352941176471, 'f1': 0.8846153846153846}


(None, None)