### Resume training from Huggingface model
Use Llama2-13B HF model as an example, 
1. Convert Llama2-13B HF model to Megatron for expected TP and PP size.
2. Resume training from converted Megatron checkpoint.

In [1]:
import os
import sys
import torch
import argparse

MEGATRON_ROOT = "/cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/"
sys.path.insert(0, MEGATRON_ROOT)

In [2]:
# import unicorn
sys.path.append(os.path.join(MEGATRON_ROOT, "tools", "unicorn"))
import unicorn

  from .autonotebook import tqdm as notebook_tqdm


[2023-10-18 05:38:06,689] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)


#### Convert Huggingface model to Megatron checkpoint
* You can also use shell in `tools/unicorn/examples/llama/convert_hf_to_megatron.sh`.

In [3]:
def parse_args():
    parser = argparse.ArgumentParser()
    parser = unicorn.add_checkpointing_args(parser)
    parser = unicorn.add_transformers_checkpoint_args(parser)
    parser = unicorn.add_megatron_checkpoint_args(parser)
    args = parser.parse_args()
    return args

sys.argv = ['script.py',
            '--megatron-path', MEGATRON_ROOT,
            '--load-path', os.path.join(MEGATRON_ROOT, "models", "Llama-2-13b-hf"),
            '--save-path', os.path.join(MEGATRON_ROOT, "models", "llama-megatron"),
            '--model-name', 'llama2-13b',
            '--template-name', 'llama',
            '--print-checkpoint-structure',
            '--target_tensor_model_parallel_size', '2',
            '--target_pipeline_model_parallel_size', '2',
            '--target_params_dtype', 'fp16']

args = parse_args()

In [4]:
unicorn.convert_checkpoint_from_transformers_to_megatron(args)

=> Loading /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/Llama-2-13b-hf/model-00001-of-00003.safetensors ...
=> Loading /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/Llama-2-13b-hf/model-00003-of-00003.safetensors ...
=> Loading /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/Llama-2-13b-hf/model-00002-of-00003.safetensors ...


cp: cannot stat '/cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/Llama-2-13b-hf/*.tiktoken': No such file or directory


=> Converting ...
=> converting embedding layer ...
=> Converting transformer blocks ...
Checkpoint structure of model state dict shard belonging to TP rank 0 and PP rank 0:
# model                                           
..# language_model                                
....# embedding                                   
......# word_embeddings                           
........# weight                                   : torch.Size([16000, 5120])
....# output_layer                                
......# weight                                     : torch.Size([16000, 5120])
....# encoder                                     
......# layers.0.input_norm.weight                 : torch.Size([5120])
......# layers.0.self_attention.query_key_value.weight : torch.Size([7680, 5120])
......# layers.0.self_attention.dense.weight       : torch.Size([5120, 2560])
......# layers.0.post_attention_norm.weight        : torch.Size([5120])
......# layers.0.mlp.dense_h_to_4h.weight          : torch

In [5]:
target_path = os.path.join(MEGATRON_ROOT, "models", "llama-megatron")
!ls {target_path}

config.json			   special_tokens_map.json
latest_checkpointed_iteration.txt  tokenizer.json
model.safetensors.index.json	   tokenizer_config.json
release


#### Launch a distributed task loading the Megatron checkpoint
* You can also check the following shells
    * `tools/unicorn/examples/llama/prepare_data.sh`
    * `tools/unicorn/examples/llama/run_examples.sh`

In [6]:
# prepare data
shell = \
"""set -x

PYTHONPATH={0} python {0}/tools/preprocess_data.py \
  --input {0}/tests/unicorn/data/sample.jsonl \
  --json-keys text \
  --tokenizer-type PretrainedFromHF \
  --tokenizer-name-or-path {1} \
  --append-eod \
  --output-prefix {0}/tests/unicorn/data/sample_llama \
  --workers 4

""".format(MEGATRON_ROOT, os.path.join(MEGATRON_ROOT, "models", "Llama-2-13b-hf"))
# print(shell)
os.system(shell)

+ PYTHONPATH=/cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/ python /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tools/preprocess_data.py --input /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/sample.jsonl --json-keys text --tokenizer-type PretrainedFromHF --tokenizer-name-or-path /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/Llama-2-13b-hf --append-eod --output-prefix /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/sample_llama --workers 4
Zarr-based strategies will not be registered because of missing packages


Opening /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/sample.jsonl
Time to startup: 0.24666452407836914


0

In [7]:
# Launch distributed training and loading megatron checkpoint built from HF model.
# Check the loss value at the beginning is nearly ~1.5, it should be reasonable.
launch_task = \
"""
MEGATRON_PATH={0}
CODE_ROOT={0}
export PYTHONPATH={0}:$PYTHONPATH
export CUDA_DEVICE_MAX_CONNECTIONS=1

NNODES=1
NODE_RANK=0
GPUS_PER_NODE=8
MASTER_ADDR=localhost
MASTER_PORT=12345

DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE \
                  --nnodes $NNODES \
                  --node_rank $NODE_RANK \
                  --master_addr $MASTER_ADDR \
                  --master_port $MASTER_PORT"

custom_options="--disable-bias-linear \
                --swiglu \
                --untie-embeddings-and-output-weights \
                --swiglu-make-ffn-hidden-size-divisible-by 256 \
                --position-embedding-type rope \
                --normalization RMSNorm \
                --norm-epsilon 1e-5 \
                --init-method-std 0.02 \
                --disable-scaled-init-method \
                "

# Llama tokenizers and use NullTokenizer
VOCAB_SIZE=$(( 32000 - 1 ))
TP=2
PP=2

DATA_ROOT="{0}/tests/unicorn/data/"
TOKENIZER_NAME_OR_PATH="/path/to/tokenizer"
DATASET_PATH=" \
    $DATA_ROOT/sample_llama_text_document \
    "

OUTPUT_BASEPATH="{0}/test"
mkdir -p "$OUTPUT_BASEPATH/tensorboard/"
mkdir -p "$OUTPUT_BASEPATH/checkpoint/"
mkdir -p "$OUTPUT_BASEPATH/log/"
TENSORBOARD_DIR="$OUTPUT_BASEPATH/tensorboard/"
mkdir -p $TENSORBOARD_DIR

SAVED_PRETRAIN_CHECKPOINT_PATH="$OUTPUT_BASEPATH/checkpoint/$NAME"
LOAD_PATH={1}

megatron_options="  \
        --save $SAVED_PRETRAIN_CHECKPOINT_PATH \
        --load $LOAD_PATH \
        --split 99.5,0.5,0 \
        --data-path $DATASET_PATH \
        --lr 3e-4 \
        --min-lr 3e-5 \
        --lr-decay-style cosine \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --adam-eps 1e-5 \
        --weight-decay 0.1 \
        --clip-grad 1.0 \
        --lr-decay-iters 50 \
        --lr-warmup-iters 10 \
        --train-iters 50 \
        --micro-batch-size 1 \
        --global-batch-size 128 \
        --num-layers 40 \
        --hidden-size 5120 \
        --num-attention-heads 40 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --attention-dropout 0.0 \
        --hidden-dropout 0.0 \
        --log-interval 1 \
        --eval-interval 1000 \
        --eval-iters 50 \
        --save-interval 1000 \
        --tensor-model-parallel-size $TP \
        --pipeline-model-parallel-size $PP \
        --num-workers 8 \
        --seed 888 \
        --tokenizer-type NullTokenizer \
        --vocab-size $VOCAB_SIZE \
        "

cd $MEGATRON_PATH

run_cmd="python -m torch.distributed.launch $DISTRIBUTED_ARGS $CODE_ROOT/pretrain_gpt.py \
         $megatron_options \
         $custom_options \
         --use-distributed-optimizer \
         --fp16 \
         --initial-loss-scale 65536 \
         --use-flash-attn \
         "

echo $run_cmd
eval $run_cmd

""".format(MEGATRON_ROOT, os.path.join(MEGATRON_ROOT, "models", "llama-megatron"))

# print(launch_task)
os.system(launch_task)

python -m torch.distributed.launch --nproc_per_node 8 --nnodes 1 --node_rank 0 --master_addr localhost --master_port 12345 /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//pretrain_gpt.py --save /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//test/checkpoint/ --load /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/llama-megatron --split 99.5,0.5,0 --data-path /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data//sample_llama_text_document --lr 3e-4 --min-lr 3e-5 --lr-decay-style cosine --adam-beta1 0.9 --adam-beta2 0.95 --adam-eps 1e-5 --weight-decay 0.1 --clip-grad 1.0 --lr-decay-iters 50 --lr-warmup-iters 10 --train-iters 50 --micro-batch-size 1 --global-batch-size 128 --num-layers 40 --hidden-size 5120 --num-attention-heads 40 --seq-length 1024 --max-position-embeddings 1024 --attention-dropout 0.0 --hidden-dropout 0.0 --log-interval 1 --eval-interval 1000 --eval-iters 50 --save

and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies will not be registered because of missing packages
Zarr-based strategies wil

using world size: 8, data-parallel-size: 2, tensor-model-parallel size: 2, pipeline-model-parallel size: 2 
using torch.float16 for parameters ...
------------------------ arguments ------------------------
  accumulate_allreduce_grads_in_fp32 .............. False
  adam_beta1 ...................................... 0.9
  adam_beta2 ...................................... 0.95
  adam_eps ........................................ 1e-05
  add_bias_linear ................................. False
  adlr_autoresume ................................. False
  adlr_autoresume_interval ........................ 1000
  apply_layernorm_1p .............................. False
  apply_query_key_layer_scaling ................... True
  apply_residual_connection_post_layernorm ........ False
  async_tensor_model_parallel_allreduce ........... True
  attention_dropout ............................... 0.0
  attention_softmax_in_fp32 ....................... False
  barrier_with_L1_time ........................

  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])
  self._dummy_overflow_buf = torch.cuda.IntTensor([0])


 > number of parameters on (tensor, pipeline) model parallel rank (0, 1): 3254072320
> learning rate decay style: cosine
 loading release checkpoint from /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/llama-megatron
 checkpoint version 3.0
  successfully loaded checkpoint from /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master/models/llama-megatron at iteration 0
(min, max) time across ranks (ms):
    load-checkpoint ................................: (3936.19, 3936.60)
[after model, optimizer, and learning rate scheduler are built] datetime: 2023-10-18 05:44:19 
> building train, validation, and test datasets ...
 > datasets target sizes (minimum size):
    train:      6400
    validation: 6400
    test:       6400
> building train, validation, and test datasets for GPT ...
Single data path provided for train, valid & test
 > building dataset index ...
    reading sequence lengths...
    reading sequence pointers...
    reading document in



 > loading doc-idx mapping from /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/index-cache/5bd83c009d154c0bac3bd39ccfa8247b_doc_idx.npy
 > loading sample-idx mapping from /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/index-cache/5bd83c009d154c0bac3bd39ccfa8247b_sample_idx.npy
 > loading shuffle-idx mapping from /cpfs/29ccba8f16c61395/data/user/liushan/projects/Megatron-LM-master//tests/unicorn/data/index-cache/5bd83c009d154c0bac3bd39ccfa8247b_shuffle_idx.npy
    loaded indexed file in 0.001 seconds
    total number of samples: 6408
    total number of epochs: 106
> finished creating GPT datasets ...
[after dataloaders are built] datetime: 2023-10-18 05:44:22 
done with setup ...
(min, max) time across ranks (ms):
    model-and-optimizer-setup ......................: (4497.73, 4512.70)
    train/valid/test-data-iterators-setup ..........: (2628.27, 2878.86)
training ...
[before the start of training ste



 iteration        1/      50 | consumed samples:          128 | elapsed time per iteration (ms): 8491.2 | learning rate: 3.000E-05 | global batch size:   128 | lm loss: 1.533932E+00 | loss scale: 65536.0 | grad norm: 1.595 | number of skipped iterations:   0 | number of nan iterations:   0 | TFLOPs: 152.43 |
[Rank 1] (after 1 iterations) memory (MB) | allocated: 37408.5986328125 | max allocated: 37408.59912109375 | reserved: 39424.0 | max reserved: 39424.0
[Rank 4] (after 1 iterations) memory (MB) | allocated: 37424.1484375 | max allocated: 37424.1796875 | reserved: 37694.0 | max reserved: 37694.0
[Rank 5] (after 1 iterations) memory (MB) | allocated: 37424.1484375 | max allocated: 37424.1796875 | reserved: 37694.0 | max reserved: 37694.0
[Rank 0] (after 1 iterations) memory (MB) | allocated: 37407.5986328125 | max allocated: 37407.59912109375 | reserved: 39708.0 | max reserved: 39708.0
 iteration        2/      50 | consumed samples:          256 | elapsed time per iteration (ms): 482

0