Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OverflowError: int too big to convert #2210

Closed
gunturbudi opened this issue Apr 15, 2021 · 7 comments · Fixed by #2502
Closed

OverflowError: int too big to convert #2210

gunturbudi opened this issue Apr 15, 2021 · 7 comments · Fixed by #2502
Labels
wontfix This will not be worked on

Comments

@gunturbudi
Copy link

gunturbudi commented Apr 15, 2021

Hello,
I'm trying to train a named entity recognition model, with this embedding: TransformerWordEmbeddings('emilyalsentzer/Bio_ClinicalBERT'). However, it always failed with OverflowError: int too big to convert. This is also happening in some other transformer word embedding such as XLNet. However, BERT and RoBERTa works fine.

Here is the full traceback of the error:

2021-04-15 09:34:48,106 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,106 Corpus: "Corpus: 778 train + 259 dev + 260 test sentences"
2021-04-15 09:34:48,106 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,106 Parameters:
2021-04-15 09:34:48,106  - learning_rate: "0.1"
2021-04-15 09:34:48,106  - mini_batch_size: "32"
2021-04-15 09:34:48,106  - patience: "3"
2021-04-15 09:34:48,106  - anneal_factor: "0.5"
2021-04-15 09:34:48,106  - max_epochs: "200"
2021-04-15 09:34:48,106  - shuffle: "True"
2021-04-15 09:34:48,106  - train_with_dev: "False"
2021-04-15 09:34:48,106  - batch_growth_annealing: "False"
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Model training base path: "/home/xxx/data/xxx-clinical-bert"
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Device: cuda:0
2021-04-15 09:34:48,107 ----------------------------------------------------------------------------------------------------
2021-04-15 09:34:48,107 Embeddings storage mode: gpu
2021-04-15 09:34:48,116 ----------------------------------------------------------------------------------------------------
Traceback (most recent call last):
  File "train_medical_2.py", line 144, in <module>
    train_ner(d + '-base-ent',corpus_base)
  File "train_medical_2.py", line 136, in train_ner
    max_epochs=200)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/trainers/trainer.py", line 381, in train
    loss = self.model.forward_loss(batch_step)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 637, in forward_loss
    features = self.forward(data_points)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/models/sequence_tagger_model.py", line 642, in forward
    self.embeddings.embed(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 81, in embed
    embedding.embed(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/base.py", line 60, in embed
    self._add_embeddings_internal(sentences)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 923, in _add_embeddings_internal
    self._add_embeddings_to_sentence(sentence)
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/flair/embeddings/token.py", line 999, in _add_embeddings_to_sentence
    truncation=True,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_base.py", line 2438, in encode_plus
    **kwargs,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 472, in _encode_plus
    **kwargs,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 379, in _batch_encode_plus
    pad_to_multiple_of=pad_to_multiple_of,
  File "/home/d111199102201607101/flair/lib/python3.6/site-packages/transformers/tokenization_utils_fast.py", line 330, in set_truncation_and_padding
    self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
OverflowError: int too big to convert

I have tried to change the embedding_storage_mode, hidden_size, and mini_batch_size. None of these gave me the fix to the issue.

Does anyone have the same issue?
Is there any way to resolve this?

Thanks

@schelv
Copy link
Contributor

schelv commented Apr 16, 2021

I think I've seen something like this before.
In my case it happened because the max_length property (or something with a similar name) was not set correctly on the transformers model/tokenizer. In that case it uses a very large number.
Which can give the error that you get.

You could check if this property is set correctly on the tokenizer and/or the model objects that are used inside the TransformerWordEmbeddings. According to this it should be 128.

@alanakbik
Copy link
Collaborator

Hi, I believe this was fixed in this PR #2191 by @tiagokv, at least for XLNet. Could you check if you see this error also on master branch?

@muellerzr
Copy link

@alanakbik can confirm it fixed one case for me where I was getting this issue!

@esmarruffo
Copy link

Hello!,

It might be too late but just in case anyone has the same issue, you just need to add the model to the NO_MAX_SEQ_LENGTH_MODELS list found at the beginning of the TransformerWordEmbeddings class declaration.

Here:

class TransformerWordEmbeddings(TokenEmbeddings):

--->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, BertModel]

    def __init__(

As you can see here, I've added the BERT model, as for me it used to cause the same error. Remember to import it from the Huggingface 🤗 transformers library first.

It looks like the PR fix made by @tiagokv is needed as well for the other transformers models besides XLNet so TransformerWordEmbeddings can always work properly.

@seyyaw
Copy link
Contributor

seyyaw commented May 11, 2021

Hello!,

It might be too late but just in case anyone has the same issue, you just need to add the model to the NO_MAX_SEQ_LENGTH_MODELS list found at the beginning of the TransformerWordEmbeddings class declaration.

Here:

class TransformerWordEmbeddings(TokenEmbeddings):

--->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, BertModel]

    def __init__(

As you can see here, I've added the BERT model, as for me it used to cause the same error. Remember to import it from the Huggingface 🤗 transformers library first.

It looks like the PR fix made by @tiagokv is needed as well for the other transformers models besides XLNet so TransformerWordEmbeddings can always work properly.

@S-glitch I have a similar issue when using the TransformerDocumentEmbeddings on RoBERTa model (custom one). How do you solve the issue for TransformerWordEmbeddings? You modify the source? Thanks

@esmarruffo
Copy link

esmarruffo commented May 11, 2021

Hi @seyyaw,

Yes, you need to have a modifiable installation of the Flair framework and edit the source. I think the easiest way would be to just clone the repository and use it directly.

In the case of RoBERTa for TransformerWordEmbeddings, we would simply add it to the list I mentioned, found in the flair/embeddings/token.py source file, like this:

class TransformerWordEmbeddings(TokenEmbeddings):

787 --->    NO_MAX_SEQ_LENGTH_MODELS=[XLNetModel, TransfoXLModel, RobertaModel]

    def __init__(

The rest should be handled automatically thanks to the PR #2191.

Now, in the case of TransformerDocumentEmbeddings, I think you will have to do a little more work, because you'd have to "port" that fix to make it work with TransformerDocumentEmbeddings as well. I'm not aware on how it works so unfortunately I can't help you with this.

@stale
Copy link

stale bot commented Sep 9, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants