-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TransformerWordEmbeddings using Spanish BERT doesn't work #2181
Comments
Hey @matirojasg , I can confirm this bug, there's something strange with the model:
So it returns a wrong value for I will try to get in contact with the model author and report back here :) |
Thank you! :) |
@stefan-it I talked to the author since he is from my university, he is going to change the configuration file. |
Hi @matirojasg , that would be awesome! I was searching for contact information of the BETO team 😅 So the easiest way would be to extend the |
Hi Stefan,
About this issue, I'm going to change the config files in a pull request.
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 31002
}
https://users.dcc.uchile.cl/~jperez/beto/uncased_2M/config.json
This is actually the config file, what key/value I have to add?
El mar, 23 de mar. de 2021 a la(s) 13:03, Stefan Schweter (
***@***.***) escribió:
… Hi @matirojasg <https://github.com/matirojasg> , that would be awesome! I
was searching for contact information of the BETO team 😅
So the easiest way would be to extend the tokenizer_config.json and add a "max_len":
512 option :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2181 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQG2L4K4IDUOUFFDOUOGNTTFC3VXANCNFSM4ZVLX65Q>
.
--
*Estudiante de Magister en Ciencias de la Computación*
*Universidad de Chile*
|
Hi @matirojasg , you can basically just use this https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json This is also for an uncased model and it additionally specifies the max. length (and sets it to 512). Hope this helps :) |
Thank you!
El jue, 25 de mar. de 2021 a la(s) 19:27, Stefan Schweter (
***@***.***) escribió:
… Hi @matirojasg <https://github.com/matirojasg> ,
you can basically just use this tokenizer_config.json file (you don't
have to change anything in the config.json file):
https://huggingface.co/dbmdz/bert-base-turkish-uncased/blob/main/tokenizer_config.json
This is also for an uncased model and it additionally specified the max.
length (and sets it to 512).
Hope this helps :)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2181 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANQG2LYQABCPNHKRRZGLD33TFO2ELANCNFSM4ZVLX65Q>
.
--
*Estudiante de Magister en Ciencias de la Computación*
*Universidad de Chile*
|
Hi. I just ran into the same issue using the 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext' model. The error went away when I manually set |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello. I have found that I cannot use the
TransformerWordEmbeddings
class for the Spanish BERT model.This is the code:
This is the error:
What should I do?
The text was updated successfully, but these errors were encountered: