Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sf2 ci fail fix #225

Closed
wants to merge 12 commits into from
Closed

Sf2 ci fail fix #225

wants to merge 12 commits into from

Conversation

wmFrank
Copy link
Collaborator

@wmFrank wmFrank commented Nov 9, 2022

auto retry when tests fail

@codecov-commenter
Copy link

codecov-commenter commented Nov 9, 2022

Codecov Report

Base: 80.53% // Head: 80.48% // Decreases project coverage by -0.05% ⚠️

Coverage data is based on head (ddb982b) compared to base (590c70d).
Patch has no changes to coverable lines.

Additional details and impacted files
@@            Coverage Diff             @@
##              sf2     #225      +/-   ##
==========================================
- Coverage   80.53%   80.48%   -0.06%     
==========================================
  Files          92       92              
  Lines        7368     7372       +4     
==========================================
- Hits         5934     5933       -1     
- Misses       1434     1439       +5     
Impacted Files Coverage Δ
sample_factory/huggingface/huggingface_utils.py 16.94% <0.00%> (-1.24%) ⬇️
sample_factory/algo/learning/learner.py 87.85% <0.00%> (-0.16%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@alex-petrenko
Copy link
Owner

https://pipelines.actions.githubusercontent.com/serviceHosts/7f2d7480-eacb-4e2d-8471-7637fec27dcc/_apis/pipelines/1/runs/1392/signedlogcontent/3?urlExpires=2022-11-13T00%3A24%3A46.0880017Z&urlSigningMethod=HMACV1&urlSignature=iSj4WeMV96aNfMlL02nTdnB3WT6yxuJYFN%2BcN9rxA%2BQ%3D

Last time the tests failed again.
Take a look, you can see the error message at the end of the log file.

2022-11-12T06:31:03.2140390Z �[31m�[1m[2022-11-12 06:31:03,210][09733] EvtLoop [learner_proc0_evt_loop, process=learner_proc0] unhandled exception in slot='init' connected to emitter=Emitter(object_id='Runner_EvtLoop', signal_name='start'), args=()�[0m
2022-11-12T06:31:03.2141340Z Traceback (most recent call last):
2022-11-12T06:31:03.2142650Z   File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
2022-11-12T06:31:03.2143020Z     slot_callable(*args)
2022-11-12T06:31:03.2143580Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
2022-11-12T06:31:03.2143930Z     init_model_data = self.learner.init()
2022-11-12T06:31:03.2144740Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
2022-11-12T06:31:03.2145340Z     self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
2022-11-12T06:31:03.2146590Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
2022-11-12T06:31:03.2148110Z     return make_actor_critic_func(cfg, obs_space, action_space)
2022-11-12T06:31:03.2148850Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
2022-11-12T06:31:03.2149290Z     return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
2022-11-12T06:31:03.2149880Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
2022-11-12T06:31:03.2150870Z     self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
2022-11-12T06:31:03.2151960Z   File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
2022-11-12T06:31:03.2152360Z     return CustomEncoder(cfg, obs_space)
2022-11-12T06:31:03.2152940Z   File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
2022-11-12T06:31:03.2153960Z     self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
2022-11-12T06:31:03.2155170Z   File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
2022-11-12T06:31:03.2155650Z     num_elements = module(some_input).numel()
2022-11-12T06:31:03.2156420Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
2022-11-12T06:31:03.2157060Z     return forward_call(*input, **kwargs)
2022-11-12T06:31:03.2157780Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
2022-11-12T06:31:03.2158340Z     input = module(input)
2022-11-12T06:31:03.2159160Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
2022-11-12T06:31:03.2160390Z     return forward_call(*input, **kwargs)
2022-11-12T06:31:03.2161080Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
2022-11-12T06:31:03.2161690Z     return self._conv_forward(input, self.weight, self.bias)
2022-11-12T06:31:03.2162270Z   File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
2022-11-12T06:31:03.2162610Z     return F.conv2d(input, weight, bias, self.stride,
2022-11-12T06:31:03.2162950Z RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27]
2022-11-12T06:31:03.2163630Z �[33m[2022-11-12 06:31:03,213][09733] Unhandled exception Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27] in evt loop learner_proc0_evt_loop�[0m

Looks like something is wrong with the observation shape, instead of an image the convolutional encoder gets a vector?
This is a legitimate error and we should properly fix it instead of retrying the test. So far I don't understand why this happens only infrequently.

@wmFrank
Copy link
Collaborator Author

wmFrank commented Nov 13, 2022

I cannot open the link, it seems to be expired. The failed test is test_example_sampler.py.

  1. I ran that test 100 times on my own macbook, it passed. I also ran that test 100 times on github actions, also passed. see here: https://github.com/wmFrank/sample-factory/actions/runs/3454065762
  2. Currently trying to reproduce the error.

@wmFrank
Copy link
Collaborator Author

wmFrank commented Nov 14, 2022

These are the failed tests that have occurred in different hanging tests:

tests/examples/test_example_multi.py
tests/algo/test_pbt.py
tests/envs/atari/test_atari.py
tests/envs/mujoco/test_mujoco.py

These are the types of errors that have occurred in different hanging tests:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/miniconda/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/local/miniconda/lib/python3.8/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 322, in rebuild_storage_filename
    storage = torch.UntypedStorage._new_shared_filename_cpu(manager, handle, size)
RuntimeError: Connection refused


Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 238, in __del__
    self.detach()
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 233, in detach
    if self.event_loop:
AttributeError: 'EventLoopProcess' object has no attribute 'event_loop'
[W NNPACK.cpp:53] Could not initialize NNPACK! Reason: Unsupported hardware.


Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
    slot_callable(*args)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
    init_model_data = self.learner.init()
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
    self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
    return make_actor_critic_func(cfg, obs_space, action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
    return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
    self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
    return CustomEncoder(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
    self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
    num_elements = module(some_input).numel()
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [8, 1, 3, 3], expected input[1, 4, 84, 84] to have 1 channels, but got 4 channels instead


Traceback (most recent call last):
  File "/usr/local/miniconda/lib/python3.8/site-packages/signal_slot/signal_slot.py", line 355, in _process_signal
    slot_callable(*args)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner_worker.py", line 139, in init
    init_model_data = self.learner.init()
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/learning/learner.py", line 214, in init
    self.actor_critic = create_actor_critic(self.cfg, self.env_info.obs_space, self.env_info.action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 296, in create_actor_critic
    return make_actor_critic_func(cfg, obs_space, action_space)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 286, in default_make_actor_critic_func
    return ActorCriticSharedWeights(model_factory, obs_space, action_space, cfg)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/model/actor_critic.py", line 141, in __init__
    self.encoder = model_factory.make_model_encoder_func(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 130, in make_custom_encoder
    return CustomEncoder(cfg, obs_space)
  File "/Users/runner/work/sample-factory/sample-factory/sf_examples/train_custom_env_custom_model.py", line 114, in __init__
    self.conv_head_out_size = calc_num_elements(self.conv_head, obs_shape)
  File "/Users/runner/work/sample-factory/sample-factory/sample_factory/algo/utils/torch_utils.py", line 39, in calc_num_elements
    num_elements = module(some_input).numel()
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 463, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/miniconda/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Expected 3D (unbatched) or 4D (batched) input to conv2d, but got input of size: [1, 27]

I ran test_example_multi more than 100 times on github actions, there shows up an error:

Components take too long to start ... Aborting the experiment.

@wmFrank wmFrank closed this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants