Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support to finetune with use_distributed_optimizer #68

Closed

Conversation

dumpmemory
Copy link
Contributor

@dumpmemory dumpmemory commented Sep 18, 2023

fix issues for finetune with use_distributed_optimizer option

@martinjaggi
Copy link
Contributor

could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)

@dumpmemory
Copy link
Contributor Author

could you comment what is the issue solved by this fix? (compared to the finetuning code and scripts we provide?)

Yes. fix the missing function when u add --use-distributed-optimizer args in fintuning scripts.

@dumpmemory
Copy link
Contributor Author

also fix #67 (comment)

@dumpmemory
Copy link
Contributor Author

any update ?

@kylematoba
Copy link
Collaborator

hi, sorry no update: the whole team is working on a big run right now and obviously changing the function signature for checkpoint loading is not something we're keen to do right now. We should be done in about a month.

@mynewstart
Copy link

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt") 

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)
        
    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)
        
        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

@dumpmemory
Copy link
Contributor Author

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt") 

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)
        
    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)
        
        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

I will update the code. i have fixed this

@dumpmemory
Copy link
Contributor Author

Hi @dumpmemory, If I used --use_checkpoint_args and --use_distributed_optimizer , an assertion error would be encountered in the code in checkpointing.py, as mpu is not initialized.

optim_name = os.path.join(
            common_path + "_%03d" % mpu.get_data_parallel_rank(),
            "optim.pt") 

The root cause is _finish_mpu_init() is called after load_args_from_checkpoint(args) in initialize.py, the code is as follows:

def initialize_megatron(extra_args_provider=None,
                        args_defaults={}):
    """Set global variables, initialize distributed, and
    set autoresume and random seeds.
    `allow_no_cuda` should not be set unless using megatron for cpu only 
    data processing. In general this arg should not be set unless you know 
    what you are doing.
    """

    # Make sure cuda is available.
    assert torch.cuda.is_available(), 'Megatron requires CUDA.'

    # Parse arguments
    args = megatron.arguments.parse_args(extra_args_provider)

    if args.use_checkpoint_args or args_defaults.get('use_checkpoint_args', False):
        assert args.load is not None, '--use-checkpoints-args requires --load argument'
        load_args_from_checkpoint(args)

    megatron.arguments.validate_args(args, args_defaults)
        
    # set global args, build tokenizer, and set adlr_autoresume,
    # tensorboard-writer, and timers.
    set_global_variables(args)

    # torch.distributed initialization
    def _finish_mpu_init():
        _initialize_distributed(args)
        
        # Random seeds for reproducibility.
        if args.rank == 0:
            print('> setting random seeds to {} ...'.format(args.seed))
        _set_random_seed(args.seed, args.data_parallel_random_init)

    # Megatron's MPU is the master. Complete initialization right away.
    _finish_mpu_init()
    _init_autoresume()
    # _compile_dependencies(args)

    # No continuation function
    return None

pls , try the new one.

@kylematoba
Copy link
Collaborator

hello @dumpmemory we're working on clearing the open issues and will be getting to this one soon. Thank you for your patience.

@kylematoba
Copy link
Collaborator

Thank you for your contribution @dumpmemory. We'll not merge this to keep our own complexity down. Sorry if this wasn't clear, but this repo is meant more as replication code for an upcoming paper than a long-lived fork from NVIDIA's megatron and we are not so interested in allocating time to adding features that we're not using. I'll add a note to the docs saying this :).

@kylematoba kylematoba closed this Nov 6, 2023
@mynewstart
Copy link

@kylematoba So does it means the main branch don't support use_distributed_optimizer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants