Skip to content
This repository has been archived by the owner on Mar 19, 2024. It is now read-only.

[WIP] Fixes for release #437

Closed
wants to merge 4 commits into from

Conversation

iseessel
Copy link
Contributor

@iseessel iseessel commented Oct 5, 2021

  1. Add in appropriate pytorch/cuda verisons for building apex in conda_apex and conda_vissl.
  2. Separate out integration_tests.sh, as this was repeating unit tests in the apex builds.
  3. Make #in_temporary_directory exception-safe -- when a test failed using this, all subsequent tests would fail with:
======================================================================
ERROR: test_restart_after_preemption_at_epoch (test_state_checkpointing.TestStateCheckpointing)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/test_utils.py", line 86, in wrapped_test
    return test_function(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/test_tmp/tests/test_state_checkpointing.py", line 80, in test_restart_after_preemption_at_epoch
    with in_temporary_directory():
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/contextlib.py", line 81, in _enter_
    return next(self.gen)
  File "/private/home/iseessel/conda-bld/vissl_1633020455653/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/test_utils.py", line 29, in in_temporary_directory
    old_cwd = os.getcwd()
FileNotFoundError: [Errno 2] No such file or directory
  1. Destroy process group after each test in test_tasks.py. After building, conda runs the unit tests for vissl, and the same process group is used after the initial test. Since we start on GPU tests, we use the nccl backend and we keep using it throughout the tests. One of the tests requires the gloo backend, since it calls all_gather on cpu tensors. Note we don't get this problem with circle-ci because we split out the tests. The specific error is:
ERROR: test_run_0_config_test_cpu_test_test_cpu_regnet_moco_yaml (test_tasks.TaskTest)
Instantiate and run all the test tasks [with config_file_path='config=test/cpu_test/test_cpu_regnet_moco.yaml']
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/parameterized/parameterized.py", line 533, in standalone_func
    return func(*(a + p.args), **p.kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/test_tmp/tests/test_tasks.py", line 50, in test_run
    hook_generator=default_hook_generator,
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/engines/train.py", line 130, in train_main
    trainer.train()
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 201, in train
    raise e
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/trainer_main.py", line 193, in train
    task = train_step_fn(task)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/trainer/train_steps/standard_train_step.py", line 158, in standard_train_step
    local_loss = task.loss(model_output, target)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/losses/moco_loss.py", line 152, in forward
    self._dequeue_and_enqueue(self.key)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/losses/moco_loss.py", line 89, in _dequeue_and_enqueue
    keys = concat_all_gather(key)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/vissl/utils/misc.py", line 230, in concat_all_gather
    torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
  File "/private/home/iseessel/conda-bld/vissl_1633035886061/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1863, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: Tensors must be CUDA and dense

----------------------------------------------------------------------
Ran 1455 tests in 2664.959s

FAILED (errors=1)
Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
WARNING:conda_build.build:Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
WARNING conda_build.build:tests_failed(2955): Tests failed for vissl-0.1.5-py36.tar.bz2 - moving package to /private/home/iseessel/conda-bld/broken
TESTS FAILED: vissl-0.1.5-py36.tar.bz2
  1. Use specific commit of fairscale as per circle-ci documentation.

@iseessel iseessel requested review from QuentinDuval and prigoyal and removed request for QuentinDuval October 5, 2021 17:23
@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@prigoyal
Copy link
Contributor

prigoyal commented Oct 5, 2021

looks good to me! let's wait for the tests to pass and we can merge it. Thank you so much!

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 5, 2021
@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@iseessel iseessel changed the title Fixes for release [WIP] Fixes for release Oct 18, 2021
@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@iseessel has updated the pull request. You must reimport the pull request before landing.

@iseessel
Copy link
Contributor Author

Closing in favor of smaller separate PRs.

@iseessel iseessel closed this Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants