Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend GPU fixes from vidurj and MaxMotovilov #1944

Merged
merged 59 commits into from Oct 26, 2018

Conversation

@brendan-ai2
Copy link
Member

commented Oct 24, 2018

  • Builds on #1724 and #1445 in order to fix #1439.
  • Differences:
    • Use a scatter analogous to the PyTorch implementation.
    • Identify metadata that needs to be scattered with a dummy ScatterableList subclass.
    • Dispatch to PyTorch's underlying tensor scatter to precisely match behavior.
  • Minor fixes to parallelize Event2Mind model. (Used to help develop the rest.)
MaxMotovilov and others added 30 commits Jun 29, 2018
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Oct 24, 2018

@vidurj, this isn't 100% ready for review yet, but what do you think of the basic concept? It should be robust and it's also very simple.

@@ -61,7 +62,7 @@ def empty_field(self) -> 'MetadataField':
@classmethod
@overrides
def batch_tensors(cls, tensor_list: List[DataArray]) -> DataArray: # type: ignore
return tensor_list # type: ignore
return ScatterableList(tensor_list) # type: ignore

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Oct 24, 2018

Member

The ProductionRuleField needs a similar modification.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

Good catch, thanks! This does make me wonder if we should modify the type that batch_tensors is supposed to return to include the possibility of using ScatterableList. But historically we've just been ignoring types here. My intuition for this in Python is not particularly developed.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Oct 25, 2018

@matt-gardner, would you mind reviewing this actually? I've cleaned it up a bit. Only remaining bit of work should be fixing the licensing. Thanks!

Copy link
Member

left a comment

LGTM. What was the issue with just putting smaller batches on separate GPUs to start with?

def to_pointer_tensor(self) -> torch.LongTensor:
"""
Converts the elements to pointers, casts them to ``int64`` and then returns them in a tensor. This cast is
important as ``id`` gives back unsigned integers while ``torch.LongTensor`` is signed.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Oct 25, 2018

Member

Ouch. That must have been annoying to find...

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

Fortunately this didn't manifest as a mysterious bug. C/C++ programming has left me with a fair degree of caution around this sort of thing.

return torch.LongTensor(pointers)

@classmethod
def from_pointer_tensor(cls, pointers) -> list:

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Oct 25, 2018

Member

Type annotation on pointers. It's a List[torch.LongTensor], where the tensor should be a single scalar? Or is it a single tensor of shape (list_length,) that you're iterating over?

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

A single tensor of shape (list_length,). Type and comment to that effect added.

"""
return [cast(c_uint64(pointer.item()).value, py_object).value for pointer in pointers]

# TODO(brendanr): Add licensing stuff for borrowing this from torch before distributing, i.e. pushing to master.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Oct 25, 2018

Member

Is it sufficient to just point to the license in the docstring, next to where you point to the github repo?

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

@schmmd suggested the same thing. Sounds good enough to me.

@@ -86,12 +88,33 @@ def test_trainer_can_run_cuda(self):
cuda_device=0)
trainer.train()

# pylint: disable=arguments-differ

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Oct 25, 2018

Member

Is this for the call to forward below? I'd say this belongs inside that scope, if that's what this is for.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

Oops! Fixed. This line escaped during some experimentation with the linter.

Copy link
Member Author

left a comment

Thanks for the review!

def to_pointer_tensor(self) -> torch.LongTensor:
"""
Converts the elements to pointers, casts them to ``int64`` and then returns them in a tensor. This cast is
important as ``id`` gives back unsigned integers while ``torch.LongTensor`` is signed.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

Fortunately this didn't manifest as a mysterious bug. C/C++ programming has left me with a fair degree of caution around this sort of thing.

return torch.LongTensor(pointers)

@classmethod
def from_pointer_tensor(cls, pointers) -> list:

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

A single tensor of shape (list_length,). Type and comment to that effect added.

"""
return [cast(c_uint64(pointer.item()).value, py_object).value for pointer in pointers]

# TODO(brendanr): Add licensing stuff for borrowing this from torch before distributing, i.e. pushing to master.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

@schmmd suggested the same thing. Sounds good enough to me.

@@ -86,12 +88,33 @@ def test_trainer_can_run_cuda(self):
cuda_device=0)
trainer.train()

# pylint: disable=arguments-differ

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Oct 25, 2018

Author Member

Oops! Fixed. This line escaped during some experimentation with the linter.

@brendan-ai2 brendan-ai2 merged commit 2a8bd63 into allenai:master Oct 26, 2018
2 of 3 checks passed
2 of 3 checks passed
codecov/patch 45% of diff hit (target 90%)
Details
Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/project 92% (-1%) compared to be1ff9f
Details
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Oct 26, 2018

Note: I've bypassed the patch code coverage check because the testing for this PR relies on having multiple GPUs available. See https://github.com/allenai/allennlp/blob/master/allennlp/tests/training/trainer_test.py#L91. We should probably have a story for this sort of integration testing.

brendan-ai2 added a commit that referenced this pull request Oct 29, 2018
- I noticed this difference while retraining the model to account for #1944.
- With this change we generate the exact same vocab as was used for the 10/5 model and now the new 10/26 model.
- In the long run we'll want to be more principled with our vocab. In the short term I'm more concerned with ensuring that we can retrain consistently.
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

commented Oct 29, 2018

@matt-gardner, sorry, somehow I missed your question about the smaller batches! Vidur asked about that and Matt Peters' main concern was as follows:

The biggest issue I see is with something like the bucket iterator.

we want the work for each GPU to be approximately equal, but the bucket iterator will yield consecutive batches with very different sequence lengths.

for training LMs, I use the bucket iterator like batching to fill each batch with the same number of tokens (batch size * sequence length is about constant)

(I'm not entirely sure I grok the issue myself. Probably will be more clear once I have something running end-to-end. For now the solution in this PR should suffice, but we can certainly revisit later.)

@matt-gardner

This comment has been minimized.

Copy link
Member

commented Oct 29, 2018

Yeah, ok, that makes sense. Thanks.

grad_norm = params.pop_float("grad_norm", None)
grad_clipping = params.pop_float("grad_clipping", None)
lr_scheduler_params = params.pop("learning_rate_scheduler", None)

if cuda_device >= 0:
model = model.cuda(cuda_device)

This comment has been minimized.

Copy link
@bryant1410

bryant1410 Dec 15, 2018

Contributor

Why is this change okay? It seems I'm having errors because of this right now with v0.7.2.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Dec 16, 2018

Author Member

We call self.model.cuda in the constructor, so I believe this is redundant. Could you please provide more detail? A stack trace would be ideal. Thank you.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Dec 16, 2018

Member

This is the cause of #2182. This needs to be here. Sorry I didn't catch this earlier. The issue is that the optimizer constructs its internal state based on the device that the model is on. For some optimizers, like adagrad, that internal state has tensors in it that need to be on the right device. That's why these lines are here - the call to model.cuda() has to be before the optimizer is created. #2190 fixes this.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Dec 16, 2018

Member

On these lines, is there any way we can make our CI at least occasionally run GPU tests? That way we could actually have a regression test for this (we've had exactly this error before, which is why these lines were originally added).

This comment has been minimized.

Copy link
@bryant1410

bryant1410 Dec 16, 2018

Contributor

👍

Maybe also adding a comment would be good then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.