Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

[Suggestion] Add parameter mini_width or something else to TokenCharactersIndexer's initializer. #1954

Closed
WrRan opened this issue Oct 24, 2018 · 2 comments

Comments

@WrRan
Copy link
Contributor

WrRan commented Oct 24, 2018

Describe the bug
I used a simple CNN on a text classification tasks. It worked well when training, but it broke when predicting.

To Reproduce
For convenience, I provide a test case at allennlp-demo-producing-bug.
Steps to reproduce the behavior

  1. Clone the repository to local dir: git clone https://github.com/WrRan/allennlp-demo-producing-bug.git demo
  2. Go to demo: cd demo
  3. Begin training:
export PYTHONPATH=`pwd` && export CUDA_VISIBLE_DEVICES=0
allennlp train config/cnn_classifier.json -s data/model/cnn --include-package tc
  1. Begin predicting:
allennlp predict data/model/cnn/model.tar.gz data/data.txt --include-package tc --predictor sentiment_classifier --use-dataset-reader
  1. See error:
Traceback (most recent call last):
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/run.py", line 20, in <module>
    main(prog="allennlp")
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
    args.func(args)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 192, in _predict
    manager.run()
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 168, in run
    for result in self._predict_instances(batch):
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 137, in _predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 53, in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 124, in forward_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 155, in forward_on_instances
    outputs = self.decode(self(**model_input))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/wangr/ws/demo/tc/cnn_classifier.py", line 58, in forward
    embedded_text = self._dropout(self.text_field_embedder(text))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(*tensors)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 36, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 35, in forward
    reshaped_outputs = self._module(*reshaped_inputs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/seq2vec_encoders/cnn_encoder.py", line 106, in forward
    self._activation(convolution_layer(tokens)).max(dim=2)[0]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 176, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (1 x 2). Kernel size: (1 x 3). Kernel size can't greater than actual input size at ...

Expected behavior
Because testdata is same as traindata, it is unexpected to get a RuntimeError.
(Moreover, the predictor used in allennlp predict is trivial.)

System (please complete the following information):

  • OS: Linux
  • Python version: 3.6.1
  • AllenNLP version: 0.6.1
  • PyTorch version: 0.4.0

Additional context
I think the error comes from the TokenCharactersIndexer under some edge cases and default settings.
The error is produced by the settings and a special datapoint:
The cnn_encoders with the setting:

          "ngram_filter_sizes": [
            2,3,4
          ],

requires the word consisting of no less than 4 characters.

configs: https://github.com/WrRan/allennlp-demo-producing-bug/blob/685dcffd3755da60fe3fb360aa6f8338571ce86d/config/cnn_classifier.json#L49-L51

However, there is an unexpected datapoint: I I I I I I I..
Thus, prediction is broken.

But training process works well.
Why?
It is in that batch_size is set as 64, and the sentence I I I I I I I. is feed to model with another good sentence The allennlp is awesome.. Under such case, class TokenCharactersIndexer produces padding_lengths of 7 (which is the length of the word awesome in the batch).

Suggestions
I think the key is the design of class TokenCharactersIndexer which produces padding_lengths just taking consideration of single one datapoint. (in method get_padding_lengths).
So it may be a good idea to add a parameter mini_width or something else to TokenCharactersIndexer's initializer.
In that case, we can config like this:

    "token_indexers": {
      "token_characters": {
        "type": "characters",
        "mini_width": 4
      }
    }

and the class TokenCharactersIndexer may work like this:

class TokenCharactersIndexer(TokenIndexer[List[int]]):
    def __init__(self,
                 mini_width: int,
                 namespace: str = 'token_characters',
                 character_tokenizer: CharacterTokenizer = CharacterTokenizer()) -> None:
        self._mini_width = mini_width
        self._namespace = namespace
        self._character_tokenizer = character_tokenizer
   
    def get_padding_lengths(self, token: List[int]) -> Dict[str, int]:
        return {"num_token_characters": max(len(token), self.mini_width)}

# other stuff

Above solution copes with my issue. And I think no default value of mini_width is a better idea.
Does this make sense?

Look forward to hearing from you. Thanks.

@matt-gardner
Copy link
Contributor

Yes, we've hit this issue multiple times, and I think adding a min_width parameter where you suggest is a good idea. PR welcome. I'd call it min_width or min_num_characters, not mini_width, and give it a default value of 0.

@WrRan
Copy link
Contributor Author

WrRan commented Oct 25, 2018

I am working on it.
I think min_padding_length is a better name.
And it may be a better idea to provide no default value for this parameter.
Because no default value makes users check the maximum value of ngram_filter_sizes (in cnn_encoder) equals the value of min_padding_length. Otherwise, this subtle issue may be hit again after "usual" training process.
What do you think? @matt-gardner

matt-gardner pushed a commit that referenced this issue Oct 26, 2018
* add min_padding_length to TokenCharactersIndexer (#1954)

* mv min_padding_length to arg-list's end

* add test_case for character_token_indexer_test
 * test min_padding_length

* add test_case for character_token_indexer_test
 * test min_padding_length

* delete annoying DOS crlf

* delete unused variable

* Remove trailing newlines
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants