TransformerWordEmbeddings using Spanish BERT doesn't work #2181

matirojasg · 2021-03-23T15:06:13Z

Hello. I have found that I cannot use the TransformerWordEmbeddings class for the Spanish BERT model.

This is the code:

from flair.data import Sentence
from flair.embeddings import TransformerWordEmbeddings

sentence = Sentence('El pasto es verde.')

# use only last layers
embeddings = TransformerWordEmbeddings('dccuchile/bert-base-spanish-wwm-uncased')
embeddings.embed(sentence)
print(sentence[0].embedding.size())

This is the error:

    324         # Set truncation and padding on the backend tokenizer
    325         if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE:
--> 326             self._tokenizer.enable_truncation(max_length, stride=stride, strategy=truncation_strategy.value)
    327         else:
    328             self._tokenizer.no_truncation()

OverflowError: int too big to convert

What should I do?

The text was updated successfully, but these errors were encountered:

stefan-it · 2021-03-23T15:28:53Z

Hey @matirojasg ,

I can confirm this bug, there's something strange with the model:

In [5]: tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-uncased")

In [6]: tokenizer.model_max_length
Out[6]: 1000000000000000019884624838656

In [7]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased")

In [8]: tokenizer.model_max_length
Out[8]: 512

So it returns a wrong value for model_max_length - for another model like BERTurk it returns the correct value.

I will try to get in contact with the model author and report back here :)

matirojasg · 2021-03-23T15:40:56Z

Thank you! :)

matirojasg · 2021-03-23T15:46:30Z

@stefan-it I talked to the author since he is from my university, he is going to change the configuration file.

stefan-it · 2021-03-23T16:03:18Z

Hi @matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅

So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :)

matirojasg · 2021-03-25T21:13:17Z

Hi Stefan, About this issue, I'm going to change the config files in a pull request. { "attention_probs_dropout_prob": 0.1, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "initializer_range": 0.02, "intermediate_size": 3072, "max_position_embeddings": 512, "num_attention_heads": 12, "num_hidden_layers": 12, "type_vocab_size": 2, "vocab_size": 31002 } https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/config.json This is actually the config file, what key/value I have to add? El mar, 23 de mar. de 2021 a la(s) 13:03, Stefan Schweter ( ***@***.***) escribió:

…

Hi @matirojasg <https://github.com/matirojasg> , that would be awesome! I was searching for contact information of the BETO team 😅 So the easiest way would be to extend the tokenizer_config.json and add a "max_len": 512 option :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2181 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANQG2L4K4IDUOUFFDOUOGNTTFC3VXANCNFSM4ZVLX65Q> .

-- *Estudiante de Magister en Ciencias de la Computación* *Universidad de Chile*

stefan-it · 2021-03-25T22:27:02Z

Hi @matirojasg ,

you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file):

https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json

This is also for an uncased model and it additionally specifies the max. length (and sets it to 512).

Hope this helps :)

matirojasg · 2021-03-25T22:41:48Z

Thank you! El jue, 25 de mar. de 2021 a la(s) 19:27, Stefan Schweter ( ***@***.***) escribió:

…

Hi @matirojasg <https://github.com/matirojasg> , you can basically just use this tokenizer_config.json file (you don't have to change anything in the config.json file): https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json This is also for an uncased model and it additionally specified the max. length (and sets it to 512). Hope this helps :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2181 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANQG2LYQABCPNHKRRZGLD33TFO2ELANCNFSM4ZVLX65Q> .

-- *Estudiante de Magister en Ciencias de la Computación* *Universidad de Chile*

sokol11 · 2021-05-24T04:13:33Z

Hi. I just ran into the same issue using the 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' model. The error went away when I manually set tokenizer.max_model_length = 512. It was set to 1000000000000000019884624838656 by default.

stale · 2021-09-21T07:37:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

matirojasg added the bug Something isn't working label Mar 23, 2021

matirojasg mentioned this issue Mar 23, 2021

Config file missed max_len dccuchile/beto#16

Closed

mahak13 mentioned this issue Jun 13, 2021

TransformerWordEmbeddings - OverflowError: int too big to convert #2305

Closed

hp0404 mentioned this issue Sep 1, 2021

using transformers' models with BERTopic MaartenGr/BERTopic#219

Closed

stale bot added the wontfix This will not be worked on label Sep 21, 2021

stale bot closed this as completed Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

matirojasg commented Mar 23, 2021

stefan-it commented Mar 23, 2021 •

edited

matirojasg commented Mar 23, 2021

matirojasg commented Mar 23, 2021

stefan-it commented Mar 23, 2021

matirojasg commented Mar 25, 2021 via email

stefan-it commented Mar 25, 2021 •

edited

matirojasg commented Mar 25, 2021 via email

sokol11 commented May 24, 2021

stale bot commented Sep 21, 2021

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

TransformerWordEmbeddings using Spanish BERT doesn't work #2181

Comments

matirojasg commented Mar 23, 2021

stefan-it commented Mar 23, 2021 • edited

matirojasg commented Mar 23, 2021

matirojasg commented Mar 23, 2021

stefan-it commented Mar 23, 2021

matirojasg commented Mar 25, 2021 via email

stefan-it commented Mar 25, 2021 • edited

matirojasg commented Mar 25, 2021 via email

sokol11 commented May 24, 2021

stale bot commented Sep 21, 2021

stefan-it commented Mar 23, 2021 •

edited

stefan-it commented Mar 25, 2021 •

edited