Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

Closed
matirojasg opened this issue Mar 23, 2021 · 9 comments
Closed

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

matirojasg opened this issue Mar 23, 2021 · 9 comments
Labels
bug Something isn't working wontfix This will not be worked on

Comments

@matirojasg
Copy link

Hello. I have found that I cannot use the TransformerWordEmbeddings class for the Spanish BERT model.

This is the code:

from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

sentence = Sentence('El pasto es verde.')

# use only last layers
embeddings = TransformerWordEmbeddings('dccuchile/bert-base-spanish-wwm-uncased')
embeddings.embed(sentence)
print(sentence[0].embedding.size())

This is the error:

    324         # Set truncation and padding on the backend tokenizer
    325         if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:
--> 326             self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
    327         else:
    328             self._tokenizer.no_truncation()

OverflowError: int too big to convert

What should I do?

@matirojasg matirojasg added the bug Something isn't working label Mar 23, 2021
@stefan-it
Copy link
Member

stefan-it commented Mar 23, 2021

Hey @matirojasg ,

I can confirm this bug, there's something strange with the model:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

I will try to get in contact with the model author and report back here :)

@matirojasg
Copy link
Author

Thank you! :)

@matirojasg
Copy link
Author

@stefan-it I talked to the author since he is from my university, he is going to change the configuration file.

@stefan-it
Copy link
Member

Hi @matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅

So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

@matirojasg
Copy link
Author

matirojasg commented Mar 25, 2021 via email

@stefan-it
Copy link
Member

stefan-it commented Mar 25, 2021

Hi @matirojasg ,

you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file):

https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json

This is also for an uncased model and it additionally specifies the max. length (and sets it to 512).

Hope this helps :)

@matirojasg
Copy link
Author

matirojasg commented Mar 25, 2021 via email

@sokol11
Copy link

sokol11 commented May 24, 2021

Hi. I just ran into the same issue using the 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' model. The error went away when I manually set tokenizer.max_model_length = 512. It was set to 1000000000000000019884624838656 by default.

@stale
Copy link

stale bot commented Sep 21, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label Sep 21, 2021
@stale stale bot closed this as completed Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants