Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance][Optimizer] Enable using UVA and FP16 with SparseAdamOptimizer #3885

Merged
merged 27 commits into from
Jun 24, 2022

Conversation

nv-dlasalle
Copy link
Collaborator

@nv-dlasalle nv-dlasalle commented Mar 25, 2022

Description

This PR enables transferring optimizer states via UVA (by default) as well as storing them in FP16 (requires opt-in). While there are many factors, combining both of these optimizations improves performance in backward by about 2x and significantly cuts down on memory usage (both from FP16, and from not need allocate buffer tensors to copy to from the GPU).

This depends on #3997 to ensure the UVA arrays get properly freed when the optimizer is destroyed.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

Changes

  • Adds "Scatter" series of UVA function for scatter data from the GPU into pinned CPU memory.
  • Adds parameters use_uva and dtype to the SparseAdamOptimizer.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 25, 2022

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

python/dgl/optim/pytorch/sparse_optim.py Show resolved Hide resolved
python/dgl/optim/pytorch/sparse_optim.py Show resolved Hide resolved
python/dgl/optim/pytorch/sparse_optim.py Outdated Show resolved Hide resolved
python/dgl/optim/pytorch/sparse_optim.py Show resolved Hide resolved
src/array/cuda/uvm/array_index_select_uvm.cu Outdated Show resolved Hide resolved
tests/pytorch/test_optim.py Outdated Show resolved Hide resolved
@classicsong classicsong self-requested a review May 5, 2022 01:28
@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@classicsong
Copy link
Contributor

LTGM

@dgl-bot

This comment was marked as outdated.

@jermainewang
Copy link
Member

The test_unified_tensor UT failed:

tests/pytorch/test_unified_tensor.py::test_unified_tensor FAILED         [ 99%]
tests/pytorch/test_unified_tensor.py::test_multi_gpu_unified_tensor[1] PASSED [ 99%]
tests/pytorch/test_unified_tensor.py::test_multi_gpu_unified_tensor[2] SKIPPED [100%]

=================================== FAILURES ===================================
_____________________________ test_unified_tensor ______________________________

    @unittest.skipIf(os.name == 'nt', reason='Do not support windows yet')
    @unittest.skipIf(F.ctx().type == 'cpu', reason='gpu only test')
    def test_unified_tensor():
        test_row_size = 65536
        test_col_size = 128
    
        rand_test_size = 8192
    
        input = th.rand((test_row_size, test_col_size))
        input_unified = dgl.contrib.UnifiedTensor(input, device=th.device('cuda'))
    
        seq_idx = th.arange(0, test_row_size)
        assert th.all(th.eq(input[seq_idx], input_unified[seq_idx]))
    
        seq_idx = seq_idx.to(th.device('cuda'))
        assert th.all(th.eq(input[seq_idx].to(th.device('cuda')), input_unified[seq_idx]))
    
        rand_idx = th.randint(0, test_row_size, (rand_test_size,))
        assert th.all(th.eq(input[rand_idx], input_unified[rand_idx]))
    
>       rand_idx = rand_idx.to(th.device('cuda'))
E       RuntimeError: CUDA error: invalid argument
E       CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
E       For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

tests/pytorch/test_unified_tensor.py:36: RuntimeError

I saw you've changed the IndexSelect operator so probably related.

python/dgl/optim/pytorch/sparse_optim.py Outdated Show resolved Hide resolved
@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@yaox12
Copy link
Collaborator

yaox12 commented Jun 23, 2022

@nv-dlasalle If you don't have anything to add, I think it's OK to merge this PR.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@yaox12 yaox12 force-pushed the uva_embedding branch 2 times, most recently from 44b4c98 to b16bb40 Compare June 23, 2022 14:24
@dgl-bot

This comment was marked as outdated.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 23, 2022

Commit ID: b16bb40

Build ID: 22

Status: ✅ CI test succeeded

Report path: link

Full logs path: link

@yaox12 yaox12 merged commit 020f024 into dmlc:master Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants