Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: duplicate file name in tar file when training #2

Open
HollowMan6 opened this issue Mar 30, 2024 · 0 comments
Open

ValueError: duplicate file name in tar file when training #2

HollowMan6 opened this issue Mar 30, 2024 · 0 comments

Comments

@HollowMan6
Copy link

I'm using CSC's HPC puhti and trying to do the multi-node distributed training (2 nodes, each with 4 V100). It generally looks fine but I'm now stuck on the dataset issues as suggested by this PR title. I guess you should encounter similar issues as well. Please check the attached stack trace and let me know your insights, thank you!

How did I setup the environment

I generally followed the steps you suggested in the README.

I installed the Python dependencies by the environment.yml file you provided through CSC's container wapper tykky. I downloaded the dataset and did the sharding via python data/dataset_prep.py -sc 7 (0..7, since the world size is 2*4=8)

Here is the config file I'm using: https://github.com/HollowMan6/view-fusion/blob/main/configs/small-v100-4-2.yaml

The job script https://github.com/HollowMan6/view-fusion/blob/main/slurm/train_ddp_v100_d.slrm

Investigation

I found a related issue at webdataset
webdataset/webdataset#157 (comment) which according to their suggestions, I executed the following command and did find many duplicated entries.

ls NMR-*.tar |xargs -n1 tar tf | sort | uniq -d

Then I tried to add more content to the "key" there to avoid collision, this time the above command outputs nothing, but I still got the error.
HollowMan6@26143ec

[default1]:Traceback (most recent call last):
[default1]:  File "/scratch/project_*/*/view-fusion/slurm/../main.py", line 38, in <module>
[default1]:    main(args)
[default1]:  File "/scratch/project_*/*/view-fusion/slurm/../main.py", line 25, in main
[default1]:    experiment.train()
[default1]:  File "/scratch/project_*/*/view-fusion/experiment.py", line 212, in train
[default1]:    for batch in self.train_loader:
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/pipeline.py", line 70, in iterator
[default1]:    yield from self.iterator1()
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[default1]:    data = self._next_data()
[default1]:           ^^^^^^^^^^^^^^^^^
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[default1]:    return self._process_data(data)
[default1]:           ^^^^^^^^^^^^^^^^^^^^^^^^
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[default1]:    data.reraise()
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/_utils.py", line 722, in reraise
[default1]:    raise exception
[default1]:ValueError: Caught ValueError in DataLoader worker process 0.
[default1]:Original Traceback (most recent call last):
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[default1]:    data = fetcher.fetch(index)
[default1]:           ^^^^^^^^^^^^^^^^^^^^
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
[default1]:    data.append(next(self.dataset_iter))
[default1]:                ^^^^^^^^^^^^^^^^^^^^^^^
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/pipeline.py", line 70, in iterator
[default1]:    yield from self.iterator1()
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 302, in _map
[default1]:    for sample in data:
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 302, in _map
[default1]:    for sample in data:
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 214, in _shuffle
[default1]:    for sample in data:
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/tariterators.py", line 246, in group_by_keys
[default1]:    if handler(exn):
[default1]:       ^^^^^^^^^^^^
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 86, in reraise_exception
[default1]:    raise exn
[default1]:  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/tariterators.py", line 239, in group_by_keys
[default1]:    raise ValueError(
[default1]:ValueError: ("b1a6f690b8ae5ee4471299059767b3d6.0000.png: duplicate file name in tar file 0000.png dict_keys(['__key__', '__url__', '0000.png', '0001.png', '0002.png', '0003.png', '0004.png', '0005.png', '0006.png', '0007.png', '0008.png', '0009.png', '0010.png', '0011.png', '0012.png', '0013.png', '0014.png', '0015.png', '0016.png', '0017.png', '0018.png', '0019.png', '0020.png', '0021.png', '0022.png', '0023.png', 'cameras'])", None, None)
[default1]:
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457767 closing signal SIGTERM
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457769 closing signal SIGTERM
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457770 closing signal SIGTERM
[2024-03-30 12:25:28,530] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 3457768) of binary: /PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/bin/python
Traceback (most recent call last):
  File "/scratch/project_*/*/view-fusion/venv/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-30_12:25:27
  host      : r02g01.bullx
  rank      : 5 (local_rank: 1)
  exitcode  : 1 (pid: 3457768)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: r02g01: task 1: Exited with exit code 1
srun: Terminating StepId=21035875.0
slurmstepd: error: *** STEP 21035875.0 ON r01g01 CANCELLED AT 2024-03-30T12:25:28 ***
srun: error: r01g01: task 0: Terminated
srun: Force Terminated StepId=21035875.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant