Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

Open
fdsig opened this issue Sep 6, 2023 · 0 comments
Open

Comments

@fdsig
Copy link

fdsig commented Sep 6, 2023

Overview:

WandB sagemaker integration takes config from environment variables and so will always return error as a result of updating run config here

This is a result of keys already being present in the WandB run config dictionary

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run in SageMaker with params:

        "--no-progress-bar",
        "--azureml-logging",
        "--cpu",
        "--tpu",
        "--bf16",
        "--memory-efficient-bf16",
        "--fp16",
        "--memory-efficient-fp16",
        "--fp16-no-flatten-grads",
        "--on-cpu-convert-precision",
        "--amp",
        "--profile",
        "--reset-logging",
        "--suppress-crashes",
        "--use-plasma-view",
        "--combine-valid-subsets",
        "--ignore-unused-valid-subsets",
        "--disable-validation",
        "--grouped-shuffling",
        "--update-epoch-batch-itr",
        "--update-ordered-indices-seed",
        "--distributed-no-spawn",
        "--find-unused-parameters",
        "--gradient-as-bucket-view",
        "--fast-stat-sync",
        "--broadcast-buffers",
        "--pipeline-model-parallel",
        "--not-fsdp-flatten-parameters",
        "--sentence-avg",
        "--continue-once",
        "--reset-dataloader",
        "--reset-lr-scheduler",
        "--reset-meters",
        "--reset-optimizer",
        "--no-save",
        "--no-epoch-checkpoints",
        "--no-last-checkpoints",
        "--no-save-optimizer-state",
        "--maximize-best-checkpoint-metric",
        "--load-checkpoint-on-all-dp-ranks",
        "--write-checkpoints-asynchronously",
        "--store-ema",
        "--ema-fp32",
        "--adaptive-input",
        "--encoder-normalize-before",
        "--encoder-learned-pos",
        "--decoder-normalize-before",
        "--decoder-learned-pos",
        "--share-decoder-input-output-embed",
        "--share-all-embeddings",
        "--merge-src-tgt-embed",
        "--no-token-positional-embeddings",
        "--layernorm-embedding",
        "--tie-adaptive-weights",
        "--tie-adaptive-proj",
        "--no-scale-embedding",
        "--checkpoint-activations",
        "--offload-activations",
        "--no-cross-attention",
        "--cross-self-attention",
        "--char-inputs",
        "--base-shuffle",
        "--export",
        "--no-decoder-final-norm",
        "--load-alignments",
        "--left-pad-source",
        "--left-pad-target",
        "--truncate-source",
        "--eval-bleu",
        "--eval-tokenized-bleu",
        "--eval-bleu-print-samples",
        "--report-accuracy",
        "--use-old-adam",
        "--fp16-adam-stats",
  1. See error
wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "task" from \"translation\ to {'_name': 'translation', 'data': '/opt/ml/input/data/shards/enc/mmap_base.bin', 'source_lang': 'en', 'target_lang': 'it', 'load_alignments': False, 'left_pad_source': True, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024, 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': 'mmap', 'required_seq_len_multiple': 1, 'eval_bleu': True, 'eval_bleu_args': '{"beam": 4, "max_len_a": 1.2, "max_len_b": 100}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args': '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': 'sentencepiece', 'eval_bleu_print_samples': True}
If you really want to do this, pass allow_val_change=True to config.update()

Code sample

 estimator = PyTorch(
        dependencies=["./train_lib/requirements.txt"],
        entry_point="train.py",
        framework_version="2.0",
        py_version="py310",
        role=role,
        instance_count=1,
        instance_type=ml.p3.xlarge,
        volume_size_in_gb=some_int,
        base_job_name=args.job_name,
        input_mode="FastFile",
        output_path=f"s3://some.bucket.uri'
        checkpoint_s3_uri="s3://some.bucket.uri',
        environment={
            "WANDB_API_KEY": wandb_api_key,
            "WANDB_BASE_URL": "https://api.wandb.ai",
        },
    )

    estimator.fit(
        {"shards": args.shards_path},
        job_name=full_job_name,

    )

Expected behavior

log normal WandB config using SM

Environment

  • fairseq Version : latest
  • PyTorch Version (e.g., 1.0): not relevant to this issue.
  • OS (e.g., Linux): SM EC2 Training Job with instance_type "ml.p3.2xlarge" Linux-5.10.186-179.751.amzn2.x86_64-x86_64-with-glibc2.31
  • How you installed fairseq (pip, source): requirements.txt in SM instance
  • Build command you used (if compiling from source): NA
  • Python version: 3.10
  • GPU models and configuration: 4 x Tesla V100-SXM2-16GB
  • Any other relevant information:

Additional context

have fixed and tested in a fork as per your contributing guidance here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant