This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infos about the german model #26
Comments
Hi @datistiquo , we use the News 2018 Crawl from WMT mirror. More precisely, the corpus can be found here: http://data.statmt.org/news-crawl/de/ And download the Technically, there's no difference between the two models you mentioned, we just copied the There's an upcoming paper about newer and larger German BERT and ELECTRA models available as pre-print: https://arxiv.org/abs/2010.10906 It is a collaboration with the Deepset guys and paper was accepted at COLING 2020 🤗 |
Thank you! Yes, I will try this new model. I already tried the base model but sadly it is much worse for my case than the original deepset model. Or mybe I missed something to handle differently between both models? I use the new model in the sense like the old model, although at huggingface a different usage code example is given. So I don't know if this model is doing right if handled with following usage:
https://huggingface.co/deepset/gbert-base btw: Did you recognized any difference in using bert models from pytorch and tf weights? Because your own dbmz models are only available in pytorch so using them in tf is doing also for my case very badly. But this could be of course due to the model itself! |
I would say this depends on the downstream task (comparison between deepset cased and DBMDZ cased) and the used hyperparameters for fine-tuning. Your example should totally work, because we uploaded an additional I'm not aware of any huge differences between TF weights and PyTorch weights in Transformers. |
Ah, the hyperparameters for the LM are any where to see? That is why I opened this issue. 🤗 Is there a paper? Do you have also any fientuned model from the dbmz models on a dowsnstream task (the paramteres for this would be interesting too)?🤗 |
A few hyper-parameters for text classification and NER are mentioned in the COLING paper see table 3. I'm not sure, what task and parameters you are using, but you should definitely vary learning rate, batch size and number of epochs (for each model independently) when you want to compare e.g. deepset and DBMDZ model for your task :) |
Ah ok, that should be a good idea to vary independently. My task is semantic sentence pair similarity task. |
I compared my finetuning results with both cased and uncased data, and find that if I use cased trainingdata with your uncased model I get more better results than using uncased. Can you explain this? Your uncased model was trained with lower cased text, right? So, I assume that I have to finetune it on lower cased text example too. How does the model actually handle then cased text. There should be no vocab for eg "Apfel"? |
Hi @datistiquo , at the moment, I'm training cased models only, because I haven't seen any task, where uncasing really helps. This could be due to the fact that e.g. the official BERT implementation is also performing accent stripping, so Just try to use cased models (and yes, it would be possible to disable accent stripping and train a model with only with lowercased corpora, but my main focus is currently on NER and cased models 😅). |
A snippet to test the tokenizer behavior: from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-dbmdz-uncased") then: In [3]: tokenizer.tokenize("Apfel")
Out[3]: ['apfel']
In [4]: tokenizer.tokenize("Äpfel")
Out[4]: ['apfel'] |
So the tokenizer of the model handles cased and uncased the same ways? So the cased words gets lowered? Does it then really does not matter if you use cased text for training? Or is there some caveat? |
It is somehow confusing. I do not want to get confused and I hope I understand it right what an uncased model really is: It just uses uncased (lowerded) words for training? :) 🤗
You mean that you completely use the huggingface tokenizer? I think that this determines the uncasing and casing?
Why that? You mean a cased model for lowered cased data? This was the reason I asked because I found that using cased data with your uncased model does better...But maybe it was just at this moment... Also I hope you also mean by training finetuning? My task is just about finetuning a pretrained model.^^ (from me)
(from you)
Although you just released new uncased model, what do you mean by that? ;-) Your new Europeana model are they mostly suitable for historic text? Could you please train an uncased model on only current news or something like that? :) |
Hi @datistiquo ,
Exactly, and the tokenizer is main responsible part. E.g. it will lowercase your input (both during pre-training and fine-tuning), so there's no need to lowercase your training corpus manually when pre-training a model from scratch, because your tokenizer will do that for you.
When you pre-train a model from scratch, take BERT or ELECTRA for example, then this is performed by their tokenizer implementation. When fine-tuning a model with HF Transformers, then thee HF tokenizer is using "pretty much" the same tokenization logic (bugs can appear, but let's assume that this is not the case). Then the HF Tokenizer will also perform lowercasing, accent stripping for you and for your data!
But this cased data will be automatically be lowercased when using an uncased model (HF Tokenizer will perform it).
Let's say "!=" pre-training. Because you could use the embeddings from the model as "features", and train a model without fine-tuning. This "feature-based" approach is e.g. used here.
You're right, but this was quite a time ago.
Yes, the Europeana models (both for German and French) are trained on historic text (and sometimes also very noisy data, because of OCR). I have currently no plans for training uncased models, but e.g. here's a model on current news that is uncased: https://huggingface.co/german-nlp-group/electra-base-german-uncased (And they tried to pre-train an uncased model without the accent stripping logic, but this requires some modification on the pre-training part, see google-research/electra#88. Later in the HF tokenizer you can simply disable that in the tokenizer configuration). |
Thank you so much. There is much information on various places. How can I be up to date for possibly new german models? Besides the mentioned uncased model above, are you aware of other german (uncased) models? Especially with relatively new news data? The above models uses news articles from 2018 I think.
I assume that for all many models outside there, the creators of such models (like you) also have to implement the tokenizer for huggingface too? How and where can I look to find out what the tokenizer is doing behind the hood when finetuning any model via huggingface? For example, let's say what is the tokenizer doing for the dbmz-cased model? :) |
I found that the comma does somehow alters the word? Is it wanted that sentence punctuation is allowed to be used for the context of the words? Before looking myself into it, how are punctuations (.,?!, etc.) handled by the tokenizer and training? I found different embeddings (and results) for eg:
So I just appended a "," directly after the word...without both is the same... |
Hi @datistiquo , so the best thing is to watch out for new models on the Hugging Face model hub. Using the language filter for German: https://huggingface.co/models?filter=de You can immediately find all available models. The tokenizer part for BERT can be found here - this implementation is then used when you fine-tune a new model: On the model hub, you can find named https://huggingface.co/german-nlp-group/electra-base-german-uncased/tree/main In this example I'll use the {"strip_accents": false, "special_tokens_map_file": null, "full_tokenizer_file": null, "max_len": 512} As you can see, In [1]: from transformers import AutoTokenizer
In [2]: tokenizer = AutoTokenizer.from_pretrained("german-nlp-group/electra-base-german-uncased")
In [3]: tokenizer.tokenize("Äpfel")
Out[3]: ['ä', '##pfel'] The input is lowercased:
ELECTRA tokenizer is based on BERT tokenizer: And the BERT tokenizer comes with this default argument: So if you have an ELECTRA model, that is cased you need to specify it in the https://huggingface.co/deepset/gelectra-large/blob/main/tokenizer_config.json |
Good question, it uses the BERT tokenizer implementation (as mentioned above) and also has a tokenizer configuration: https://huggingface.co/dbmdz/bert-base-german-cased/blob/main/tokenizer_config.json that disables lowercasing. |
Could you provide a more detailed example for the punctuation problem 🤔 I tried: In [6]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
In [7]: tokenizer.tokenize("Obst,")
Out[7]: ['obst', ','] so far. |
So, punctuations are getting own embedings? I dont want that a single comma influences my finetuning results. Maybe I have to remove such things beforehand then? |
Exactly, so I highly recommend that you perform some kind of preprocessing (that include punctuation removal, or even removing stop words). |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hey,
is this here
the only infos about the german bert model? Or is there any paper about it?
I am interested in the News Crawl. Do you know what "news" are within (I think you used a language classificator to get german news only?)? Is there really a backup of online news sources like spiegel.de/WELT/n-tv...? That is currently also my objective how I can get such german news articles text?
Thank you!
PS:
What are the difference between these models?
dbmdz/bert-base-german-uncased and
bert-base-german-dbmdz-uncased
The text was updated successfully, but these errors were encountered: