OverflowError: int too big to convert #2210

gunturbudi · 2021-04-15T02:49:01Z

Hello,
I'm trying to train a named entity recognition model, with this embedding: TransformerWordEmbeddings('emilyalsentzer/Bio_ClinicalBERT'). However, it always failed with OverflowError: int too big to convert. This is also happening in some other transformer word embedding such as XLNet. However, BERT and RoBERTa works fine.

Here is the full traceback of the error:

2021-04-15 09:34:48,106 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,106 Corpus: "Corpus: 778 train + 259 dev + 260 test sentences"
2021-04-15 09:34:48,106 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,106 Parameters:
2021-04-15 09:34:48,106  - learning_rate: "0.1"
2021-04-15 09:34:48,106  - mini_batch_size: "32"
2021-04-15 09:34:48,106  - patience: "3"
2021-04-15 09:34:48,106  - anneal_factor: "0.5"
2021-04-15 09:34:48,106  - max_epochs: "200"
2021-04-15 09:34:48,106  - shuffle: "True"
2021-04-15 09:34:48,106  - train_with_dev: "False"
2021-04-15 09:34:48,106  - batch_growth_annealing: "False"
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Model training base path: "/home/xxx/data/xxx-clinical-bert"
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Device: cuda:0
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Embeddings storage mode: gpu
2021-04-15 09:34:48,116 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "train_medical_2.py", line 144, in <module>
    train_ner(d + '-base-ent',corpus_base)
  File "train_medical_2.py", line 136, in train_ner
    max_epochs=200)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/trainers/trainer.py", line 381, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 637, in forward_loss
    features = self.forward(data_points)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 642, in forward
    self.embeddings.embed(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 81, in embed
    embedding.embed(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
    self._add_embeddings_to_sentence(sentence)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 999, in _add_embeddings_to_sentence
    truncation=True,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 379, in _batch_encode_plus
    pad_to_multiple_of=pad_to_multiple_of,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 330, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert

I have tried to change the embedding_storage_mode, hidden_size, and mini_batch_size. None of these gave me the fix to the issue.

Does anyone have the same issue?
Is there any way to resolve this?

Thanks

The text was updated successfully, but these errors were encountered:

schelv · 2021-04-16T08:40:37Z

I think I've seen something like this before.
In my case it happened because the max_length property (or something with a similar name) was not set correctly on the transformers model/tokenizer. In that case it uses a very large number.
Which can give the error that you get.

You could check if this property is set correctly on the tokenizer and/or the model objects that are used inside the TransformerWordEmbeddings. According to this it should be 128.

alanakbik · 2021-04-16T09:54:58Z

Hi, I believe this was fixed in this PR #2191 by @tiagokv, at least for XLNet. Could you check if you see this error also on master branch?

muellerzr · 2021-04-19T22:09:47Z

@alanakbik can confirm it fixed one case for me where I was getting this issue!

esmarruffo · 2021-05-07T22:06:58Z

Hello!,

It might be too late but just in case anyone has the same issue, you just need to add the model to the NO_MAX_SEQ_LENGTH_MODELS list found at the beginning of the TransformerWordEmbeddings class declaration.

Here:

class TransformerWordEmbeddings(TokenEmbeddings):

--->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, BertModel]

    def __init__(

As you can see here, I've added the BERT model, as for me it used to cause the same error. Remember to import it from the Huggingface 🤗 transformers library first.

It looks like the PR fix made by @tiagokv is needed as well for the other transformers models besides XLNet so TransformerWordEmbeddings can always work properly.

seyyaw · 2021-05-11T12:28:22Z

Hello!,

It might be too late but just in case anyone has the same issue, you just need to add the model to the NO_MAX_SEQ_LENGTH_MODELS list found at the beginning of the TransformerWordEmbeddings class declaration.

Here:
class TransformerWordEmbeddings(TokenEmbeddings):

--->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, BertModel]

    def __init__(
As you can see here, I've added the BERT model, as for me it used to cause the same error. Remember to import it from the Huggingface 🤗 transformers library first.

It looks like the PR fix made by @tiagokv is needed as well for the other transformers models besides XLNet so TransformerWordEmbeddings can always work properly.

@S-glitch I have a similar issue when using the TransformerDocumentEmbeddings on RoBERTa model (custom one). How do you solve the issue for TransformerWordEmbeddings? You modify the source? Thanks

esmarruffo · 2021-05-11T19:31:41Z

Hi @seyyaw,

Yes, you need to have a modifiable installation of the Flair framework and edit the source. I think the easiest way would be to just clone the repository and use it directly.

In the case of RoBERTa for TransformerWordEmbeddings, we would simply add it to the list I mentioned, found in the flair/embeddings/token.py source file, like this:

class TransformerWordEmbeddings(TokenEmbeddings):

787 --->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, RobertaModel]

    def __init__(

The rest should be handled automatically thanks to the PR #2191.

Now, in the case of TransformerDocumentEmbeddings, I think you will have to do a little more work, because you'd have to "port" that fix to make it work with TransformerDocumentEmbeddings as well. I'm not aware on how it works so unfortunately I can't help you with this.

stale · 2021-09-09T01:19:25Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix This will not be worked on label Sep 9, 2021

stale bot closed this as completed Sep 16, 2021

This was referenced Nov 3, 2021

add default model_max_length for HF Tokenizer #2501

Closed

default handling when Tokenizer's model_max_length is not set #2502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OverflowError: int too big to convert #2210

OverflowError: int too big to convert #2210

gunturbudi commented Apr 15, 2021 •

edited

schelv commented Apr 16, 2021

alanakbik commented Apr 16, 2021

muellerzr commented Apr 19, 2021

esmarruffo commented May 7, 2021

seyyaw commented May 11, 2021

esmarruffo commented May 11, 2021 •

edited

stale bot commented Sep 9, 2021

OverflowError: int too big to convert #2210

OverflowError: int too big to convert #2210

Comments

gunturbudi commented Apr 15, 2021 • edited

schelv commented Apr 16, 2021

alanakbik commented Apr 16, 2021

muellerzr commented Apr 19, 2021

esmarruffo commented May 7, 2021

seyyaw commented May 11, 2021

esmarruffo commented May 11, 2021 • edited

stale bot commented Sep 9, 2021

gunturbudi commented Apr 15, 2021 •

edited

esmarruffo commented May 11, 2021 •

edited