Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python run.py with data_root="/arrows_flickr30k" num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="vilt_200k_mlm_itm.ckpt" #6

Open
jkkishore1999 opened this issue May 27, 2021 · 15 comments

Comments

@jkkishore1999
Copy link

I am encountering this error:

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
ERROR - ViLT - Failed after 0:00:12!
Traceback (most recent calls WITHOUT Sacred internals):
File "run.py", line 17, in main
model = ViLTransformerSS(_config)
File "/others/cs16b114/ViLT/vilt/modules/vilt_module.py", line 61, in init
ckpt = torch.load(self.hparams.config["load_path"], map_location="cpu")
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 527, in load
with _open_zipfile_reader(f) as opened_zipfile:
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 224, in init
super(_open_zipfile_reader, self).init(torch.C.PyTorchFileReader(name_or_buffer))
RuntimeError: version
<= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fb11cc1a193 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7fafcd29cafb in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7fafcd29dd14 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch.so)
frame #3: + 0x6c6296 (0x7fb0ad870296 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: + 0x2957d4 (0x7fb0ad43f7d4 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)

frame #6: python() [0x4d8067]
frame #8: python() [0x58f850]
frame #10: python() [0x54aa51]
frame #14: python() [0x58f6a3]
frame #19: python() [0x54a880]
frame #23: python() [0x58f6a3]
frame #30: python() [0x5db3e4]
frame #32: + 0x431b (0x7fb0e0b9531b in /usr/local/lib/python3.7/dist-packages/wrapt/_wrappers.cpython-37m-x86_64-linux-gnu.so)
frame #37: python() [0x59412b]
frame #42: python() [0x54a880]
frame #51: python() [0x6308e2]
frame #54: python() [0x65450e]
frame #56: __libc_start_main + 0xe7 (0x7fb12285abf7 in /lib/x86_64-linux-gnu/libc.so.6)

@dandelin
Copy link
Owner

As error reports Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132), It's highly likely that your PyTorch and PyTorch-lightning versions mismatch with ours.

Please install the latest Pytorch version.

@jkkishore1999
Copy link
Author

Can you please let me know the recommended PyTorch and PyTorch-ligthining versions?
I have already done with the step :
pip install -r requirements.txt
pip install -e .

still the above error came.

Do we need to modify requirements.txt?

@dandelin
Copy link
Owner

@jkkishore1999
Pytorch is not in the requirements.txt, so you are using your own version of installed Pytorch.
Pytorch > 1.7 should work fine.

@jkkishore1999
Copy link
Author

My pytorch version is 1.8. Still there are some other errors

python run.py with data_root=/data2/dsets/dataset num_gpus=1 num_nodes=1 task_finetune_irtr_f30k_randaug per_gpu_batchsize=4 load_path="weights/vilt_200k_mlm_itm.ckpt"

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
Using environment variable NODE_RANK for node rank ().
INFO - lightning - Using environment variable NODE_RANK for node rank ().
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 48, in main
trainer = pl.Trainer(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars
return fn(self, **kwargs)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 347, in init
self.accelerator_connector.on_trainer_init(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 127, in on_trainer_init
self.trainer.node_rank = self.determine_ddp_node_rank()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

Can you please help?

@jkkishore1999
Copy link
Author

Also sometimes, for the same execution, another error is coming,

WARNING - root - Changed type of config entry "max_steps" from int to NoneType
WARNING - ViLT - No observers have been added to this run
INFO - ViLT - Running command 'main'
INFO - ViLT - Started
Global seed set to 0
INFO - lightning - Global seed set to 0
GPU available: True, used: True
INFO - lightning - GPU available: True, used: True
TPU available: None, using: 0 TPU cores
INFO - lightning - TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
INFO - lightning - LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Using native 16bit precision.
INFO - lightning - Using native 16bit precision.
Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
WARNING - lightning - Missing logger folder: result/finetune_irtr_f30k_randaug_seed0_from_vilt_200k_mlm_itm
Global seed set to 0
INFO - lightning - Global seed set to 0
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - lightning - initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1
INFO - root - Added key: store_based_barrier_key:1 to store for rank: 0
ERROR - ViLT - Failed after 0:00:05!
Traceback (most recent call last):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 312, in run_commandline
return self.run(
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 276, in run
run()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/run.py", line 238, in call
self.result = self.main_function(*args)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/config/captured_function.py", line 42, in captured_function
result = wrapped(*args, **kwargs)
File "run.py", line 71, in main
trainer.fit(model, datamodule=dm)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
results = self.accelerator_backend.train()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 152, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 268, in ddp_train
self.trainer.call_setup_hook(model)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 859, in call_setup_hook
self.datamodule.setup(stage_name)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/multitask_datamodule.py", line 34, in setup
dm.setup(stage)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

@dandelin
Copy link
Owner

  1. export MASTER_ADDR=$DIST_0_IP
    export MASTER_PORT=$DIST_0_PORT
    export NODE_RANK=$DIST_RANK
    please check you’ve set the above environment variables
  2. is your data located in data_root=/data2/dsets/dataset or data_root=/arrows_flickr30k?

@jkkishore1999
Copy link
Author

Thanks for the help.

  1. Even after running above 3 export commands. Those 3 environment vairables are set to null only
    declare -x MASTER_ADDR=""
    declare -x MASTER_PORT=""
    declare -x NODE_RANK=""
  2. I have changed by data_root to /data2/dsets/dataset
    Can you please help?

@dandelin
Copy link
Owner

dandelin commented May 28, 2021

packages/pytorch_lightning/accelerators/accelerator_connector.py", line 415, in determine_ddp_node_rank
return int(rank)
ValueError: invalid literal for int() with base 10: ''

The error above seems due to that DDP is not properly initialized.

Do you mean the export command did not change the environment variables?
Setting those environment variables is necessary for PyTorch-lightning to do the DDP training properly.
Please make sure that those variables are set. (you can check current environment variables using the env command)
Also, check out this guide

File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

Also, this error was probably raised because the list tables is empty.
Please check the dataset file is in f"{data_dir}/{name}.arrow" in advance.

@hongzhenwang
Copy link

1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file.
2. export MASTER_ADDR="0.0.0.0"
export MASTER_PORT="8000"
export NODE_RANK=0

@Miazzzzx
Copy link

Have you solved your problem? I‘v encountered the same problem. If it is possible could you please tell me how to solve it.

@631212502
Copy link

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

@631212502
Copy link

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

I have found the reason, the address is missing ”“

@ThompsonISAT
Copy link

Have you solved it? I have the same bug. The location of data and env have been checked. but 'print(tables)' always gets none(empty). Maybe there is something wrong with the way I set the address, can you tell me where the data should be placed in the root directory to make the command "data_root=/data2/dsets/dataset" can be run directly.

I have found the reason, the address is missing ”“

Hi even if I add the "" for address, I still get the same error. Could you help me to fix it? Thank you so much!

@XX1nn
Copy link

XX1nn commented Apr 11, 2023

@jkkishore1999 I also noticed that you are using num_gpus=1 num_nodes=1. I use the same parameters with you. Now I have reported the same error as you.

File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/pytorch_lightning/core/datamodule.py", line 92, in wrapped_fn
return fn(*args, **kwargs)
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 137, in setup
self.set_train_dataset()
File "/others/cs16b114/ViLT/vilt/datamodules/datamodule_base.py", line 76, in set_train_dataset
self.train_dataset = self.dataset_cls(
File "/others/cs16b114/ViLT/vilt/datasets/f30k_caption_karpathy_dataset.py", line 15, in init
super().init(*args, **kwargs, names=names, text_column_name="caption")
File "/others/cs16b114/ViLT/vilt/datasets/base_dataset.py", line 53, in init
self.table_names += [name] * len(tables[i])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "run.py", line 11, in
def main(_config):
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 190, in automain
self.run_commandline()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/experiment.py", line 347, in run_commandline
print_filtered_stacktrace()
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 493, in print_filtered_stacktrace
print(format_filtered_stacktrace(filter_traceback), file=sys.stderr)
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 528, in format_filtered_stacktrace
return "".join(filtered_traceback_format(tb_exception))
File "/others/cs16b114/anaconda3/envs/vilt/lib/python3.8/site-packages/sacred/utils.py", line 568, in filtered_traceback_format
current_tb = tb_exception.exc_traceback
AttributeError: 'TracebackException' object has no attribute 'exc_traceback'

May I know how to resolve it in the end. Is it to first introduce environment variables and then determine whether the file exists? However, I use a non distributed training, how can I determine the parameters of environment variables(MASTER_ADDR="" MASTER_PORT="" NODE_RANK="")

@XX1nn
Copy link

XX1nn commented Apr 11, 2023

1.Make Arrow file. Conversion scripts are located in vilt/utils/write_*.py. Run make_arrow functions to convert the dataset to pyarrow binary file. 2. export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0
export MASTER_ADDR="0.0.0.0" export MASTER_PORT="8000" export NODE_RANK=0

@hongzhenwang
Thanks for your answer. Are the values you mentioned for non distributed applications? Is the meaning of 0.0.0.0 applicable to any IP? can i just use the value "0.0.0.0" and NODE_RANK=0 for my non distributed finetuing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants