Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of empty ListFields. #2697

Merged
merged 24 commits into from May 6, 2019
Merged

Conversation

@brendan-ai2
Copy link
Member

brendan-ai2 commented Apr 6, 2019

  • There appears to be an edge case in our handling of empty ListFields such that we fail to tensorize a batch of entirely empty ones.
  • Proposed solution: Make TextField always return a minimum padding size of 1.
  • Fixes #2660
@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 6, 2019

@DeNeutoy, could you sanity check this idea for me? Maybe adding the minimum padding of 1 makes more sense in the ListField itself? Really not sure this current approach makes sense.

Will add a test as soon as we're on the same page regarding the approach. (Linked bug does have a repro.)

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Apr 6, 2019

I feel like we should either be returning None or throwing an error in this case, instead of this (maybe even an EmptyTensor error, to make it easily catchable). This is either a preprocessing error, or an optional Field. I can't think of a reason that you would actually want an empty tensor like this; having None seems like it would be strictly better.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 8, 2019

@matt-gardner, I need to rename this PR. We actually already try to support empty ListFields as evidenced by this comment. The current solution is not to generate an empty tensor, but to generate a tensor entirely of padding. This PR is simply trying to support the use case where a given batch of ListFields all happen to be empty by adding in a minimum amount of padding. Sorry if any of that is old news to you! Further advice appreciated.

@brendan-ai2 brendan-ai2 changed the title Stab at handling empty ListFields. Improve handling of empty ListFields. Apr 8, 2019
@DeNeutoy

This comment has been minimized.

Copy link
Collaborator

DeNeutoy commented Apr 8, 2019

The distinction that Matt was trying to make is the difference between indexing an entire batch of input Instances, all of which have an empty field of type ListField[Field] versus the case where there exists at least one non-empty ListField[Field] in the batch.

In the first case, a better approach is to not construct the field in the first place, as it is completely empty for everything in the batch - instead, it should be an optional field in the model.

The comment you linked to is describing the second case, which was the initial motivation behind having the .empty_field() method, allowing nested padding of fields.

As an additional complication to this issue, the snippet @brendan-ai2 posted in #2660 has different behaviour in pytorch 0.4.1 vs 1.0.X (in 1.0 it doesn't break, because unsqueezing of empty tensors is allowed due to multi-dimensional empty tensors, see pytorch/pytorch#9947) This seems like a pretty gross change which is going to be hard to handle in both versions. Maybe it's time to require torch >= 1.0?

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 8, 2019

As a matter of software engineering it seems questionable to have two different representations for an object (the empty list) based on the other objects it happens to be in a collection with (the batch). Of course, this is unavoidable to some extent as padding length is determined by the batch, but this is going a step further.

Consider that if you're handling empty lists at all then you must have a suitable means of converting a list of all padding into something sensible for your model. (For when the batch isn't all empty.) What you and Matt are proposing, IIUC, would imply that you also need a way of converting None into something sensible for your model. Furthermore, it had better be exactly the same as converting a list of all padding. Isn't this just unnecessary work for our users?

I'm also concerned about conflating None and []. This isn't true in Python or most other programming languages (Lisp being the notable exception) and is a rather heavy handed assumption to make. The user may need to distinguish between empty and "not even present".

I'm worried I'm missing something here, so please do correct any misconceptions. :)

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 8, 2019

Switching over to 1.0.X may be slightly orthogonal here. Existing models that handle empty lists will work with the solution I proposed, it's not obvious to me that they'll work with multi-dimensional empty tensors.

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Apr 8, 2019

I guess my concern is that I don't understand the use case for this, and I'm hesitant to recommend a solution to a problem I don't understand. I can't think of a reason that you would want a list of TextFields (or any other field) that is totally empty for a particular batch. Maybe I'm not being creative enough, but in all of the cases I can think of, encountering this indicates a bug in your code or a flaw in your model design, so exposing that to the user would be the right solution. If someone has a good use case where you actually want an empty list for the whole batch, I'm happy to change my opinion.

@DeNeutoy

This comment has been minimized.

Copy link
Collaborator

DeNeutoy commented Apr 8, 2019

I think the point I'm trying to make is that if you ever need to write ListField(dummy_field.empty_field()) in your dataset reader, you are doing something wrong.

There are a couple of reasons an input could be optional to a model:

  1. The input is a label. In this case, passing an instance without labels is meaningless, as the loss won't be a function of this instance (in the standard, fully - supervised setting) and during training, it should be explicitly filtered out in your dataset reader.

  2. The input is actually an optional input. As we don't support batching of instances with heterogeneous fields, you'd need to use an iterator which explicitly handles datasets which have heterogeneous instances, such as this one.

A reasonable question in this case is "Why do we even need ListFields to have this empty_field() method, if calling it explicitly is an anti-pattern?"

The reason is because we do support nested ListFields. An example would be a QA system which takes a variable number (but strictly > 0) of facts, each with a variable number of sources, for example. This would be represented as a tensor of shape batch_size, num_facts, num_sources.

In the case that we receive two instances, one with a single fact, and one with two facts, we need to be able to pad a full dimension of a ListField, because the ListField is nested over two dimensions. In this case, the way we get the padding right for arbitrary sub-field types is to have this outer dimension be padded with a list of empty_fields, which is what the .empty_field method on ListField constructs for you (see here: https://github.com/allenai/allennlp/blob/master/allennlp/data/fields/list_field.py#L81)

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 8, 2019

@matt-gardner, this can by chance, no? The user isn't intentionally creating a batch entirely of empty lists, but they do generate empty lists occasionally. If we happen to batch them together as-is, we'll crash. This is simply a bug that I would like to fix.

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Apr 8, 2019

@brendan-ai2 - that doesn't really answer the question. In what real example would you actually encounter an entirely empty tensor? @DeNeutoy gave a good description of why the empty list is supported, and it's not because there's an empty tensor somewhere.

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Apr 8, 2019

And to be clear, I agree that we should definitely do something about #2660, I'm just not sure if the fix should be to give a more helpful error message or to allow the behavior. In the absence of a compelling use case, it seems like we should do the former, because I think you only encounter this problem if you did something wrong.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 8, 2019

Sorry, Matt. Mark and I just spent a fair amount of time talking about this in person as this seems somewhat delicate. One quick clarification that I think is redundant -- but just in case --, I'm not proposing creating empty tensors, just handling empty lists more fully. The tensors themselves will be non-empty and consist of padding.

When Mark and I chatted I proposed two examples:

  1. For the salience work we have a dataset consisting of documents (titles and sentences) and pairs of entities. We extract mention spans for each entity from the title and sentences. For the title these are stored as ListField[SpanField]. See https://github.com/allenai/entity_salience/blob/master/entity_salience/dataset_readers/wikinews_salience.py#L116. When the entity doesn't occur in the title, this is an empty list. A typical batch would be a mix of empty lists and non-empty lists (for this particular field). Mark pointed out a reasonable alternative here for when we can assume that a span must exist in either the title or sentences (which is the case currently), but we actually have to filter out a number of instances from our dataset to achieve this. At some point I would like to experiment with some fuzzy matching which would get us back to square one: sometimes the mention spans would be empty.

  2. Consider the case of bot detection for a social network. Your input might consist of some scalar/string data like username, IP, geolocation, and email as well as a list of actions taken: posting content to a particular page, following some other user, etc. A new user would have no actions yet, an empty list. Older users, of course, would have many actions. This action list isn't optional for the model either -- an empty list could potentially be a strong signal. A placeholder value can be used in many cases (it's pretty natural if you're going to feed the list to a seq2vec), but it's harder to deal with in others. In the salience case above we already have code to disregard padding, so adding additional code to disregard a placeholder seems strange.

In short, my argument is that empty lists are natural. They occur alongside non-empty lists and thus using None is not suitable given our current batching.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 9, 2019

One of Mark's points was that processing padding naively can be questionable. If we extend the empty list proposal to empty strings (which I'm not suggesting, to be clear), we could take the example of the basic classifier. The embedded_text at https://github.com/allenai/allennlp/blob/master/allennlp/models/basic_classifier.py#L108 would be all zeros. This means we would end up simply outputting the bias of self._classification_layer to the softmax. I'd argue that a user can account for this (add an additional non-linear layer before the linear classification layer perhaps?), but Mark is rightfully pointing out that this could be an unhappy surprise for a user. This may be a case of a tradeoff between expressiveness and safer defaults.

@matt-gardner

This comment has been minimized.

Copy link
Member

matt-gardner commented Apr 9, 2019

@brendan-ai2, would you mind adding a simple test to this PR, preferably modeled after your salience example? I'm having a hard time visualizing what the model would get as input with your proposed fix, and thinking about how it should be handled by the model.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 10, 2019

@matt-gardner, sorry for the delay here. I'm close to having a proper end-to-end test. I'll ping when that's ready.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 12, 2019

@matt-gardner, please take a look at empty_list_test.py. I tried to strike a balance between avoiding a complicated test and avoiding giving you a contrived example. It's a bit on the long side, but should be much simpler now. Happy to elaborate on any points. Thank you for your help!

results = model.forward(**batch)["score"]
# For the sample data:
# {"body": "This is a test.", "entity_name": "exam", "entity_mentions": ["test", "quiz"]}
# {"body": "The dog went on a walk.", "entity_name": "animal", "entity_mentions": ["hound", "puppy"]}

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Apr 12, 2019

Member

Sorry, I didn't mean to make you do so much work for this simple thing! I think you're saying that this will fail (before this PR) if the entity mentions are empty for one instance? Is that right?

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Apr 25, 2019

Author Member

Hey @matt-gardner, sorry for the delay on this. I've picked it back up and made some fixes. These tests behave as follows:

test_empty_list_can_be_tensorized: Broken on master with torch 0.4. Fixed in this PR.
test_end_to_end_broken_without_fix: Broken on master even with torch 1.0. Fixed in this PR.
test_end_to_end_works_in_master: Works in master. This is because the mentions for at least one instance in the batch is non-empty.

The trace for the test that would still be broken with torch 1.0 is:

allennlp/tests/end_to_end/empty_list_test.py:144: in forward
    embedded_mentions_tokens = self.embedder(entity_mentions)
../../../anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/module.py:489: in __call__
    result = self.forward(*input, **kwargs)
allennlp/modules/text_field_embedders/basic_text_field_embedder.py:123: in forward
    token_vectors = embedder(*tensors)
../../../anaconda3/envs/allennlp/lib/python3.6/site-packages/torch/nn/modules/module.py:489: in __call__
    result = self.forward(*input, **kwargs)
allennlp/modules/token_embedders/embedding.py:133: in forward
    inputs = util.combine_initial_dims(inputs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

tensor = tensor([], size=(1, 1, 0), dtype=torch.int64)

    def combine_initial_dims(tensor: torch.Tensor) -> torch.Tensor:
        """
        Given a (possibly higher order) tensor of ids with shape
        (d1, ..., dn, sequence_length)
        Return a view that's (d1 * ... * dn, sequence_length).
        If original tensor is 1-d or 2-d, return it as is.
        """
        if tensor.dim() <= 2:
            return tensor
        else:
>           return tensor.view(-1, tensor.size(-1))
E           RuntimeError: cannot reshape tensor of 0 elements into shape [-1, 0]

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Apr 25, 2019

Member

Ok, thanks for this. To rephrase your use case: you have sparse inputs that are used as additional features for some prediction, and they are sparse enough that they can be empty for some cases. It would be silly to try to handle these as None in your model; it makes a whole lot more sense to just have a minimally-sized tensor that gets entirely masked and has no effect on the rest of the model.

I was thinking about possible alternatives for this; you could, e.g., try a MultiLabelField, but that's probably not a great idea, and definitely wouldn't work if your sparse inputs are sentences. I can't think of a better solution than what you have. Sorry for all of the hassle, I just didn't understand the use case well enough before seeing the code.

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner Apr 25, 2019

Member

Another alternative would be to try to explicitly label something as a Sparse field, or something, but that's probably more work than it's worth. Easier to just let the List field handle this.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 Apr 26, 2019

Author Member

Yep, your rephrasing is spot on. :) And no worries about the hassle! Handling/using padding properly strikes me as an area where care is especially warranted.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 26, 2019

I've cleaned up some type/lint errors and removed the usage of Glove embeddings in the tests. Should be ready for a final pass.

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented Apr 30, 2019

@matt-gardner / @DeNeutoy, could one of you please give this a final pass? Thank you.

@matt-gardner matt-gardner assigned matt-gardner and unassigned DeNeutoy May 1, 2019
@@ -0,0 +1,218 @@
# pylint: disable=no-self-use,invalid-name

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner May 1, 2019

Member

I have pretty mixed feelings about having a separate end_to_end module for tests like this - shouldn't this just go in tests/data/fields/list_field_test.py? I understand why you wanted to do it this way, but if the classes you added here had already been in the library for an easy import, you certainly would have just added the test to the ListField file.

And as for having an end_to_end module in general - basically all of our models have "end to end" tests, and they live in tests/models/, so it's not clear what this is for.

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 May 4, 2019

Author Member

Moved to list_field_test.py as requested. This was mainly an artifact of a previous iteration of this PR where the fix was in text_field.py though logically it was more related to ListField.

I think there's an argument to made for having slower/broader integration style tests be kept separate from regular unit tests, but short of a project-wide cleanup I agree it's probably not terrifically useful.

instance.as_tensor_dict()

# A batch with entirely empty lists.
def test_end_to_end_broken_without_fix(self):

This comment has been minimized.

Copy link
@matt-gardner

matt-gardner May 1, 2019

Member

This test name will be totally meaningless in a few months once we've forgotten about this PR. Can you name it something more descriptive of what it's actually testing? Same with the test below. Also, I feel like the salience stuff was very helpful for getting me to understand what the motivation was for this, but I'm not sure it's necessary to have it in the test. I believe you just largely copied that code from your separate repo, anyway. What do you think of just removing it and having these two tests be similar to what you have in the test above, with a comment explaining in text an actual use case for this behavior (to make it easy, you could just copy my rephrasing)?

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 May 4, 2019

Author Member

Indeed. I should have cleaned this up previously. Sorry for missing that. I've simplified the tests and copied your rephrasing.

brendan-ai2 added 7 commits May 4, 2019
Copy link
Member Author

brendan-ai2 left a comment

Feedback addressed. Please take another look.

@@ -0,0 +1,218 @@
# pylint: disable=no-self-use,invalid-name

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 May 4, 2019

Author Member

Moved to list_field_test.py as requested. This was mainly an artifact of a previous iteration of this PR where the fix was in text_field.py though logically it was more related to ListField.

I think there's an argument to made for having slower/broader integration style tests be kept separate from regular unit tests, but short of a project-wide cleanup I agree it's probably not terrifically useful.

instance.as_tensor_dict()

# A batch with entirely empty lists.
def test_end_to_end_broken_without_fix(self):

This comment has been minimized.

Copy link
@brendan-ai2

brendan-ai2 May 4, 2019

Author Member

Indeed. I should have cleaned this up previously. Sorry for missing that. I've simplified the tests and copied your rephrasing.

brendan-ai2 added 2 commits May 4, 2019
fix
Copy link
Member

matt-gardner left a comment

Thanks @brendan-ai2!

@brendan-ai2

This comment has been minimized.

Copy link
Member Author

brendan-ai2 commented May 6, 2019

Thanks for the review, @matt-gardner !

@brendan-ai2 brendan-ai2 merged commit 63be47a into allenai:master May 6, 2019
3 checks passed
3 checks passed
Pull Requests (AllenNLP Library) TeamCity build finished
Details
codecov/patch 100% of diff hit (target 90%)
Details
codecov/project 92% (+<1%) compared to a701a0a
Details
reiyw added a commit to reiyw/allennlp that referenced this pull request Nov 12, 2019
- There appears to be an edge case in our handling of empty ListFields such that we fail to tensorize a batch of entirely empty ones. 
- Proposed solution: Make `TextField` always return a minimum padding size of 1.
- Fixes allenai#2660
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.