You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm using CSC's HPC puhti and trying to do the multi-node distributed training (2 nodes, each with 4 V100). It generally looks fine but I'm now stuck on the dataset issues as suggested by this PR title. I guess you should encounter similar issues as well. Please check the attached stack trace and let me know your insights, thank you!
How did I setup the environment
I generally followed the steps you suggested in the README.
I installed the Python dependencies by the environment.yml file you provided through CSC's container wapper tykky. I downloaded the dataset and did the sharding via python data/dataset_prep.py -sc 7 (0..7, since the world size is 2*4=8)
I found a related issue at webdataset webdataset/webdataset#157 (comment) which according to their suggestions, I executed the following command and did find many duplicated entries.
ls NMR-*.tar |xargs -n1 tar tf | sort | uniq -d
Then I tried to add more content to the "key" there to avoid collision, this time the above command outputs nothing, but I still got the error. HollowMan6@26143ec
[default1]:Traceback (most recent call last):
[default1]: File "/scratch/project_*/*/view-fusion/slurm/../main.py", line 38, in <module>
[default1]: main(args)
[default1]: File "/scratch/project_*/*/view-fusion/slurm/../main.py", line 25, in main
[default1]: experiment.train()
[default1]: File "/scratch/project_*/*/view-fusion/experiment.py", line 212, in train
[default1]: for batch in self.train_loader:
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/pipeline.py", line 70, in iterator
[default1]: yield from self.iterator1()
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__
[default1]: data = self._next_data()
[default1]: ^^^^^^^^^^^^^^^^^
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1346, in _next_data
[default1]: return self._process_data(data)
[default1]: ^^^^^^^^^^^^^^^^^^^^^^^^
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1372, in _process_data
[default1]: data.reraise()
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/_utils.py", line 722, in reraise
[default1]: raise exception
[default1]:ValueError: Caught ValueError in DataLoader worker process 0.
[default1]:Original Traceback (most recent call last):
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
[default1]: data = fetcher.fetch(index)
[default1]: ^^^^^^^^^^^^^^^^^^^^
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
[default1]: data.append(next(self.dataset_iter))
[default1]: ^^^^^^^^^^^^^^^^^^^^^^^
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/pipeline.py", line 70, in iterator
[default1]: yield from self.iterator1()
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 302, in _map
[default1]: for sample in data:
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 302, in _map
[default1]: for sample in data:
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 214, in _shuffle
[default1]: for sample in data:
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/tariterators.py", line 246, in group_by_keys
[default1]: if handler(exn):
[default1]: ^^^^^^^^^^^^
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/filters.py", line 86, in reraise_exception
[default1]: raise exn
[default1]: File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/webdataset/tariterators.py", line 239, in group_by_keys
[default1]: raise ValueError(
[default1]:ValueError: ("b1a6f690b8ae5ee4471299059767b3d6.0000.png: duplicate file name in tar file 0000.png dict_keys(['__key__', '__url__', '0000.png', '0001.png', '0002.png', '0003.png', '0004.png', '0005.png', '0006.png', '0007.png', '0008.png', '0009.png', '0010.png', '0011.png', '0012.png', '0013.png', '0014.png', '0015.png', '0016.png', '0017.png', '0018.png', '0019.png', '0020.png', '0021.png', '0022.png', '0023.png', 'cameras'])", None, None)
[default1]:
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457767 closing signal SIGTERM
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457769 closing signal SIGTERM
[2024-03-30 12:25:27,830] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3457770 closing signal SIGTERM
[2024-03-30 12:25:28,530] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 3457768) of binary: /PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/bin/python
Traceback (most recent call last):
File "/scratch/project_*/*/view-fusion/venv/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.2.0', 'console_scripts', 'torchrun')())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/PUHTI_TYKKY_rdjx0HO/miniconda/envs/env1/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
../main.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-03-30_12:25:27
host : r02g01.bullx
rank : 5 (local_rank: 1)
exitcode : 1 (pid: 3457768)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: r02g01: task 1: Exited with exit code 1
srun: Terminating StepId=21035875.0
slurmstepd: error: *** STEP 21035875.0 ON r01g01 CANCELLED AT 2024-03-30T12:25:28 ***
srun: error: r01g01: task 0: Terminated
srun: Force Terminated StepId=21035875.0
The text was updated successfully, but these errors were encountered:
I'm using CSC's HPC puhti and trying to do the multi-node distributed training (2 nodes, each with 4 V100). It generally looks fine but I'm now stuck on the dataset issues as suggested by this PR title. I guess you should encounter similar issues as well. Please check the attached stack trace and let me know your insights, thank you!
How did I setup the environment
I generally followed the steps you suggested in the README.
I installed the Python dependencies by the
environment.yml
file you provided through CSC's container wapper tykky. I downloaded the dataset and did the sharding viapython data/dataset_prep.py -sc 7
(0..7, since the world size is 2*4=8)Here is the config file I'm using: https://github.com/HollowMan6/view-fusion/blob/main/configs/small-v100-4-2.yaml
The job script https://github.com/HollowMan6/view-fusion/blob/main/slurm/train_ddp_v100_d.slrm
Investigation
I found a related issue at webdataset
webdataset/webdataset#157 (comment) which according to their suggestions, I executed the following command and did find many duplicated entries.
Then I tried to add more content to the "key" there to avoid collision, this time the above command outputs nothing, but I still got the error.
HollowMan6@26143ec
The text was updated successfully, but these errors were encountered: