[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

WrRan · 2018-10-24T17:50:17Z

Describe the bug
I used a simple CNN on a text classification tasks. It worked well when training, but it broke when predicting.

To Reproduce
For convenience, I provide a test case at allennlp-demo-producing-bug.
Steps to reproduce the behavior

Clone the repository to local dir: git clone https://github.com/WrRan/allennlp-demo-producing-bug.git demo
Go to demo: cd demo
Begin training:

export PYTHONPATH=`pwd` && export CUDA_VISIBLE_DEVICES=0
allennlp train config/cnn_classifier.json -s data/model/cnn --include-package tc

Begin predicting:

allennlp predict data/model/cnn/model.tar.gz data/data.txt --include-package tc --predictor sentiment_classifier --use-dataset-reader

See error:

Traceback (most recent call last):
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/run.py", line 20, in <module>
    main(prog="allennlp")
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/__init__.py", line 70, in main
    args.func(args)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 192, in _predict
    manager.run()
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 168, in run
    for result in self._predict_instances(batch):
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/commands/predict.py", line 137, in _predict_instances
    results = [self._predictor.predict_instance(batch_data[0])]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/predictors/predictor.py", line 53, in predict_instance
    outputs = self._model.forward_on_instance(instance)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 124, in forward_on_instance
    return self.forward_on_instances([instance])[0]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/models/model.py", line 155, in forward_on_instances
    outputs = self.decode(self(**model_input))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/wangr/ws/demo/tc/cnn_classifier.py", line 58, in forward
    embedded_text = self._dropout(self.text_field_embedder(text))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 88, in forward
    token_vectors = embedder(*tensors)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 36, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 35, in forward
    reshaped_outputs = self._module(*reshaped_inputs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/allennlp/modules/seq2vec_encoders/cnn_encoder.py", line 106, in forward
    self._activation(convolution_layer(tokens)).max(dim=2)[0]
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/user_data/anaconda3/envs/docqa/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 176, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (1 x 2). Kernel size: (1 x 3). Kernel size can't greater than actual input size at ...

Expected behavior
Because testdata is same as traindata, it is unexpected to get a RuntimeError.
(Moreover, the predictor used in allennlp predict is trivial.)

System (please complete the following information):

OS: Linux
Python version: 3.6.1
AllenNLP version: 0.6.1
PyTorch version: 0.4.0

Additional context
I think the error comes from the TokenCharactersIndexer under some edge cases and default settings.
The error is produced by the settings and a special datapoint:
The cnn_encoders with the setting:

          "ngram_filter_sizes": [
            2,3,4
          ],

requires the word consisting of no less than 4 characters.

configs: https://github.com/WrRan/allennlp-demo-producing-bug/blob/685dcffd3755da60fe3fb360aa6f8338571ce86d/config/cnn_classifier.json#L49-L51

However, there is an unexpected datapoint: I I I I I I I..
Thus, prediction is broken.

But training process works well.
Why?
It is in that batch_size is set as 64, and the sentence I I I I I I I. is feed to model with another good sentence The allennlp is awesome.. Under such case, class TokenCharactersIndexer produces padding_lengths of 7 (which is the length of the word awesome in the batch).

Suggestions
I think the key is the design of class TokenCharactersIndexer which produces padding_lengths just taking consideration of single one datapoint. (in method get_padding_lengths).
So it may be a good idea to add a parameter mini_width or something else to TokenCharactersIndexer's initializer.
In that case, we can config like this:

    "token_indexers": {
      "token_characters": {
        "type": "characters",
        "mini_width": 4
      }
    }

and the class TokenCharactersIndexer may work like this:

class TokenCharactersIndexer(TokenIndexer[List[int]]):
    def __init__(self,
                 mini_width: int,
                 namespace: str = 'token_characters',
                 character_tokenizer: CharacterTokenizer = CharacterTokenizer()) -> None:
        self._mini_width = mini_width
        self._namespace = namespace
        self._character_tokenizer = character_tokenizer
   
    def get_padding_lengths(self, token: List[int]) -> Dict[str, int]:
        return {"num_token_characters": max(len(token), self.mini_width)}

# other stuff

Above solution copes with my issue. And I think no default value of mini_width is a better idea.
Does this make sense?

Look forward to hearing from you. Thanks.

The text was updated successfully, but these errors were encountered:

matt-gardner · 2018-10-24T17:55:22Z

Yes, we've hit this issue multiple times, and I think adding a min_width parameter where you suggest is a good idea. PR welcome. I'd call it min_width or min_num_characters, not mini_width, and give it a default value of 0.

WrRan · 2018-10-25T06:36:32Z

I am working on it.
I think min_padding_length is a better name.
And it may be a better idea to provide no default value for this parameter.
Because no default value makes users check the maximum value of ngram_filter_sizes (in cnn_encoder) equals the value of min_padding_length. Otherwise, this subtle issue may be hit again after "usual" training process.
What do you think? @matt-gardner

* add min_padding_length to TokenCharactersIndexer (#1954) * mv min_padding_length to arg-list's end * add test_case for character_token_indexer_test * test min_padding_length * add test_case for character_token_indexer_test * test min_padding_length * delete annoying DOS crlf * delete unused variable * Remove trailing newlines

WrRan mentioned this issue Oct 25, 2018

add min_padding_length to TokenCharactersIndexer (#1954) #1967

Merged

matt-gardner closed this as completed Oct 26, 2018

WrRan mentioned this issue Nov 7, 2018

modify training_configs related issue #1954 #1997

Merged

matt-gardner pushed a commit that referenced this issue Nov 7, 2018

modify training_configs related issue #1954 (#1997)

68cbfb8

matt-gardner mentioned this issue Nov 13, 2018

RuntimeError happened when I using simple_tagger model with chinese corpus #2044

Closed

matt-gardner mentioned this issue Dec 18, 2018

NER predictor failing on "2 ." #2210

Closed

WrRan mentioned this issue Jan 6, 2019

[Suggestion] Warn default value of min_padding_length when using TokenCharactersIndexer #2287

Closed

matt-gardner mentioned this issue Feb 24, 2019

Fix min padding length in pretrained NER predictors #2541

Merged

sajastu mentioned this issue Apr 1, 2019

Segmentation fault error in prediction time #2679

Closed

vishwas31 mentioned this issue Apr 3, 2019

CoreNLP like Server #2684

Closed

junkyul mentioned this issue Apr 6, 2019

model with modified data set reader junkyul/allenel#3

Closed

KalidindiMounika mentioned this issue Apr 30, 2019

Service for Named Entity Recognition #2781

Closed

aolney mentioned this issue Jul 19, 2019

Warnings in docker allenai/allennlp-demo#233

Closed

StrangeTcy mentioned this issue Sep 23, 2019

The demo coreference resolution code doesn't seem to work #3275

Closed

raychn1997 mentioned this issue Feb 20, 2020

coref model results differed from the demo allenai/allennlp-demo#317

Closed

mchari mentioned this issue Jan 26, 2021

get user warning when I try to run NER demo code #4932

Closed

chenxshani mentioned this issue Feb 22, 2021

json.decoder.JSONDecodeError allenai/kb#34

Open

MJ2468 mentioned this issue Jul 27, 2022

allennlp.common.checks.ConfigurationError: key "data_loader" is required epwalsh/nlp-models#45

Open

JingjunYi mentioned this issue Sep 14, 2023

Can't load wiki archive knowbert_wiki_wordnet_model.tar.gz. MCLAB-OCR/KnowledgeMiningWithSceneText#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

WrRan commented Oct 24, 2018 •

edited

Loading

matt-gardner commented Oct 24, 2018

WrRan commented Oct 25, 2018 •

edited

Loading

[Suggestion] Add parameter mini_width or something else to TokenCharactersIndexer's initializer. #1954

[Suggestion] Add parameter mini_width or something else to TokenCharactersIndexer's initializer. #1954

Comments

WrRan commented Oct 24, 2018 • edited Loading

matt-gardner commented Oct 24, 2018

WrRan commented Oct 25, 2018 • edited Loading

[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

[Suggestion] Add parameter `mini_width` or something else to `TokenCharactersIndexer`'s initializer. #1954

WrRan commented Oct 24, 2018 •

edited

Loading

WrRan commented Oct 25, 2018 •

edited

Loading