Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

fdsig · 2023-09-06T16:55:19Z

Overview:

WandB sagemaker integration takes config from environment variables and so will always return error as a result of updating run config here

This is a result of keys already being present in the WandB run config dictionary

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Run in SageMaker with params:

        "--no-progress-bar",
        "--azureml-logging",
        "--cpu",
        "--tpu",
        "--bf16",
        "--memory-efficient-bf16",
        "--fp16",
        "--memory-efficient-fp16",
        "--fp16-no-flatten-grads",
        "--on-cpu-convert-precision",
        "--amp",
        "--profile",
        "--reset-logging",
        "--suppress-crashes",
        "--use-plasma-view",
        "--combine-valid-subsets",
        "--ignore-unused-valid-subsets",
        "--disable-validation",
        "--grouped-shuffling",
        "--update-epoch-batch-itr",
        "--update-ordered-indices-seed",
        "--distributed-no-spawn",
        "--find-unused-parameters",
        "--gradient-as-bucket-view",
        "--fast-stat-sync",
        "--broadcast-buffers",
        "--pipeline-model-parallel",
        "--not-fsdp-flatten-parameters",
        "--sentence-avg",
        "--continue-once",
        "--reset-dataloader",
        "--reset-lr-scheduler",
        "--reset-meters",
        "--reset-optimizer",
        "--no-save",
        "--no-epoch-checkpoints",
        "--no-last-checkpoints",
        "--no-save-optimizer-state",
        "--maximize-best-checkpoint-metric",
        "--load-checkpoint-on-all-dp-ranks",
        "--write-checkpoints-asynchronously",
        "--store-ema",
        "--ema-fp32",
        "--adaptive-input",
        "--encoder-normalize-before",
        "--encoder-learned-pos",
        "--decoder-normalize-before",
        "--decoder-learned-pos",
        "--share-decoder-input-output-embed",
        "--share-all-embeddings",
        "--merge-src-tgt-embed",
        "--no-token-positional-embeddings",
        "--layernorm-embedding",
        "--tie-adaptive-weights",
        "--tie-adaptive-proj",
        "--no-scale-embedding",
        "--checkpoint-activations",
        "--offload-activations",
        "--no-cross-attention",
        "--cross-self-attention",
        "--char-inputs",
        "--base-shuffle",
        "--export",
        "--no-decoder-final-norm",
        "--load-alignments",
        "--left-pad-source",
        "--left-pad-target",
        "--truncate-source",
        "--eval-bleu",
        "--eval-tokenized-bleu",
        "--eval-bleu-print-samples",
        "--report-accuracy",
        "--use-old-adam",
        "--fp16-adam-stats",

See error

wandb.sdk.lib.config_util.ConfigError: Attempted to change value of key "task" from \"translation\ to {'_name': 'translation', 'data': '/opt/ml/input/data/shards/enc/mmap_base.bin', 'source_lang': 'en', 'target_lang': 'it', 'load_alignments': False, 'left_pad_source': True, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024, 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': 'mmap', 'required_seq_len_multiple': 1, 'eval_bleu': True, 'eval_bleu_args': '{"beam": 4, "max_len_a": 1.2, "max_len_b": 100}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args': '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': 'sentencepiece', 'eval_bleu_print_samples': True}
If you really want to do this, pass allow_val_change=True to config.update()

Code sample

 estimator = PyTorch(
        dependencies=["./train_lib/requirements.txt"],
        entry_point="train.py",
        framework_version="2.0",
        py_version="py310",
        role=role,
        instance_count=1,
        instance_type=ml.p3.xlarge,
        volume_size_in_gb=some_int,
        base_job_name=args.job_name,
        input_mode="FastFile",
        output_path=f"s3://some.bucket.uri'
        checkpoint_s3_uri="s3://some.bucket.uri',
        environment={
            "WANDB_API_KEY": wandb_api_key,
            "WANDB_BASE_URL": "https://api.wandb.ai",
        },
    )

    estimator.fit(
        {"shards": args.shards_path},
        job_name=full_job_name,

    )

Expected behavior

log normal WandB config using SM

Environment

fairseq Version : latest
PyTorch Version (e.g., 1.0): not relevant to this issue.
OS (e.g., Linux): SM EC2 Training Job with instance_type "ml.p3.2xlarge" Linux-5.10.186-179.751.amzn2.x86_64-x86_64-with-glibc2.31
How you installed fairseq (pip, source): requirements.txt in SM instance
Build command you used (if compiling from source): NA
Python version: 3.10
GPU models and configuration: 4 x Tesla V100-SXM2-16GB
Any other relevant information:

Additional context

have fixed and tested in a fork as per your contributing guidance here

The text was updated successfully, but these errors were encountered:

fdsig added bug needs triage labels Sep 6, 2023

fdsig mentioned this issue Sep 7, 2023

Makes Fairseq compatible with WandB when running in SageMaker so that experiments can be tracked #5316

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

fdsig commented Sep 6, 2023 •

edited

Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

Crashing in Sagmaker as a result of attempting to overwrite wandb config #5313

Comments

fdsig commented Sep 6, 2023 • edited

Overview:

To Reproduce

Code sample

Expected behavior

Environment

Additional context

fdsig commented Sep 6, 2023 •

edited