Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infos about the german model #26

Closed
datistiquo opened this issue Nov 11, 2020 · 19 comments
Closed

Infos about the german model #26

datistiquo opened this issue Nov 11, 2020 · 19 comments

Comments

@datistiquo
Copy link

datistiquo commented Nov 11, 2020

Hey,

is this here

The source data for the model consists of a recent Wikipedia dump, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl and News Crawl. This results in a dataset with a size of 16GB and 2,350,234,427 tokens.

For sentence splitting, we use spacy. Our preprocessing steps (sentence piece model for vocab generation) follow those used for training SciBERT. The model is trained with an initial sequence length of 512 subwords and was performed for 1.5M steps.

the only infos about the german bert model? Or is there any paper about it?

I am interested in the News Crawl. Do you know what "news" are within (I think you used a language classificator to get german news only?)? Is there really a backup of online news sources like spiegel.de/WELT/n-tv...? That is currently also my objective how I can get such german news articles text?

Thank you!

PS:

What are the difference between these models?

dbmdz/bert-base-german-uncased and

bert-base-german-dbmdz-uncased

@stefan-it
Copy link
Collaborator

stefan-it commented Nov 11, 2020

Hi @datistiquo ,

we use the News 2018 Crawl from WMT mirror. More precisely, the corpus can be found here:

http://data.statmt.org/news-crawl/de/

And download the news.2018.de.shuffled.deduped.gz. The resulting (uncompressed) file size is then 4.1GB of text.

Technically, there's no difference between the two models you mentioned, we just copied the bert-base-german-dbmdz-uncased to our dbmdz namespace.

There's an upcoming paper about newer and larger German BERT and ELECTRA models available as pre-print:

https://arxiv.org/abs/2010.10906

It is a collaboration with the Deepset guys and paper was accepted at COLING 2020 🤗

@datistiquo
Copy link
Author

datistiquo commented Nov 12, 2020

Thank you! Yes, I will try this new model. I already tried the base model but sadly it is much worse for my case than the original deepset model. Or mybe I missed something to handle differently between both models?
The large model is more difficult to use because of memory issues! :(

I use the new model in the sense like the old model, although at huggingface a different usage code example is given. So I don't know if this model is doing right if handled with following usage:

tokenizer = BertTokenizer.from_pretrained("deepset/gbert-base")
bert_model = TFBertModel.from_pretrained("deepset/gbert-base")

https://huggingface.co/deepset/gbert-base

btw: Did you recognized any difference in using bert models from pytorch and tf weights? Because your own dbmz models are only available in pytorch so using them in tf is doing also for my case very badly. But this could be of course due to the model itself!

@stefan-it
Copy link
Collaborator

I would say this depends on the downstream task (comparison between deepset cased and DBMDZ cased) and the used hyperparameters for fine-tuning.

Your example should totally work, because we uploaded an additional tf_model.hf to the model hub and this model is a pytorch-to-tensorflow converted model, that is ready to use with the TFBertModel 🤗

I'm not aware of any huge differences between TF weights and PyTorch weights in Transformers.

@datistiquo
Copy link
Author

I would say this depends on the downstream task (comparison between deepset cased and DBMDZ cased) and the used hyperparameters for fine-tuning.

Ah, the hyperparameters for the LM are any where to see? That is why I opened this issue. 🤗 Is there a paper? Do you have also any fientuned model from the dbmz models on a dowsnstream task (the paramteres for this would be interesting too)?🤗

@stefan-it
Copy link
Collaborator

A few hyper-parameters for text classification and NER are mentioned in the COLING paper see table 3.

I'm not sure, what task and parameters you are using, but you should definitely vary learning rate, batch size and number of epochs (for each model independently) when you want to compare e.g. deepset and DBMDZ model for your task :)

@datistiquo
Copy link
Author

Ah ok, that should be a good idea to vary independently. My task is semantic sentence pair similarity task.

@datistiquo
Copy link
Author

@stefan-it

I compared my finetuning results with both cased and uncased data, and find that if I use cased trainingdata with your uncased model I get more better results than using uncased. Can you explain this? Your uncased model was trained with lower cased text, right? So, I assume that I have to finetune it on lower cased text example too. How does the model actually handle then cased text. There should be no vocab for eg "Apfel"?

@stefan-it
Copy link
Collaborator

Hi @datistiquo ,

at the moment, I'm training cased models only, because I haven't seen any task, where uncasing really helps. This could be due to the fact that e.g. the official BERT implementation is also performing accent stripping, so ä is transformerd to a. In your example the tokenizer will transform Apfel to apfel.

Just try to use cased models (and yes, it would be possible to disable accent stripping and train a model with only with lowercased corpora, but my main focus is currently on NER and cased models 😅).

@stefan-it
Copy link
Collaborator

A snippet to test the tokenizer behavior:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-german-dbmdz-uncased")

then:

In [3]: tokenizer.tokenize("Apfel")
Out[3]: ['apfel']

In [4]: tokenizer.tokenize("Äpfel")
Out[4]: ['apfel']

@datistiquo
Copy link
Author

So the tokenizer of the model handles cased and uncased the same ways? So the cased words gets lowered?

Does it then really does not matter if you use cased text for training? Or is there some caveat?

@datistiquo
Copy link
Author

datistiquo commented Feb 10, 2021

It is somehow confusing.

I do not want to get confused and I hope I understand it right what an uncased model really is: It just uses uncased (lowerded) words for training? :) 🤗

We use the awesome 🤗 / Tokenizers library for building the BERT-compatible vocab (32,000 subwords).

You mean that you completely use the huggingface tokenizer? I think that this determines the uncasing and casing?

Just try to use cased models (and yes, it would be possible to disable accent stripping and train a model with only with lowercased corpora, but my main focus is currently on NER and cased models 😅).

Why that? You mean a cased model for lowered cased data? This was the reason I asked because I found that using cased data with your uncased model does better...But maybe it was just at this moment...

Also I hope you also mean by training finetuning? My task is just about finetuning a pretrained model.^^

(from me)

I compared my finetuning results with both cased and uncased data, and find that if I use cased trainingdata with your uncased model I get more better results than using uncased. Can you explain this? Your uncased model was trained with lower cased text, right? So, I assume that I have to finetune it on lower cased text example too.

(from you)

at the moment, I'm training cased models only, because I haven't seen any task, where uncasing really helps.

Although you just released new uncased model, what do you mean by that? ;-)

Your new Europeana model are they mostly suitable for historic text? Could you please train an uncased model on only current news or something like that? :)

@stefan-it
Copy link
Collaborator

Hi @datistiquo ,

I do not want to get confused and I hope I understand it right what an uncased model really is: It just uses uncased (lowerded) words for training? :)

Exactly, and the tokenizer is main responsible part. E.g. it will lowercase your input (both during pre-training and fine-tuning), so there's no need to lowercase your training corpus manually when pre-training a model from scratch, because your tokenizer will do that for you.

You mean that you completely use the huggingface tokenizer? I think that this determines the uncasing and casing?

When you pre-train a model from scratch, take BERT or ELECTRA for example, then this is performed by their tokenizer implementation. When fine-tuning a model with HF Transformers, then thee HF tokenizer is using "pretty much" the same tokenization logic (bugs can appear, but let's assume that this is not the case). Then the HF Tokenizer will also perform lowercasing, accent stripping for you and for your data!

This was the reason I asked because I found that using cased data with your uncased model does better...But maybe it was just at this moment...

But this cased data will be automatically be lowercased when using an uncased model (HF Tokenizer will perform it).

Also I hope you also mean by training finetuning? My task is just about finetuning a pretrained model.^^

Let's say "!=" pre-training. Because you could use the embeddings from the model as "features", and train a model without fine-tuning. This "feature-based" approach is e.g. used here.

Although you just released new uncased model, what do you mean by that? ;-)

You're right, but this was quite a time ago.

Your new Europeana model are they mostly suitable for historic text? Could you please train an uncased model on only current news or something like that? :)

Yes, the Europeana models (both for German and French) are trained on historic text (and sometimes also very noisy data, because of OCR). I have currently no plans for training uncased models, but e.g. here's a model on current news that is uncased:

https://huggingface.co/german-nlp-group/electra-base-german-uncased

(And they tried to pre-train an uncased model without the accent stripping logic, but this requires some modification on the pre-training part, see google-research/electra#88. Later in the HF tokenizer you can simply disable that in the tokenizer configuration).

@datistiquo
Copy link
Author

Thank you so much. There is much information on various places. How can I be up to date for possibly new german models? Besides the mentioned uncased model above, are you aware of other german (uncased) models? Especially with relatively new news data? The above models uses news articles from 2018 I think.

Then the HF Tokenizer will also perform lowercasing, accent stripping for you and for your data!

I assume that for all many models outside there, the creators of such models (like you) also have to implement the tokenizer for huggingface too? How and where can I look to find out what the tokenizer is doing behind the hood when finetuning any model via huggingface?

For example, let's say what is the tokenizer doing for the dbmz-cased model? :)

@datistiquo
Copy link
Author

datistiquo commented Feb 11, 2021

@stefan-it

I found that the comma does somehow alters the word? Is it wanted that sentence punctuation is allowed to be used for the context of the words? Before looking myself into it, how are punctuations (.,?!, etc.) handled by the tokenizer and training?

I found different embeddings (and results) for eg:

"obst,"
"Obst"

So I just appended a "," directly after the word...without both is the same...

@stefan-it
Copy link
Collaborator

Hi @datistiquo ,

so the best thing is to watch out for new models on the Hugging Face model hub. Using the language filter for German:

https://huggingface.co/models?filter=de

You can immediately find all available models.

The tokenizer part for BERT can be found here - this implementation is then used when you fine-tune a new model:

https://github.com/huggingface/transformers/blob/master/src/transformers/models/bert/tokenization_bert.py

On the model hub, you can find named tokenizer_config.json, that includes all necessary configuration options for a tokenizer:

https://huggingface.co/german-nlp-group/electra-base-german-uncased/tree/main

In this example I'll use the german-nlp-group/electra-base-german-uncased for further explanations. Their tokenizer configuration is:

{"strip_accents": false, "special_tokens_map_file": null, "full_tokenizer_file": null, "max_len": 512}

As you can see, strip_accents is set to False. So e.g. the word Äpfel will be tokenized like this:

In [1]: from transformers import AutoTokenizer

In [2]: tokenizer = AutoTokenizer.from_pretrained("german-nlp-group/electra-base-german-uncased")

In [3]: tokenizer.tokenize("Äpfel")
Out[3]: ['ä', '##pfel']

The input is lowercased:

  • the model is uncased, but you don't see this from the tokenizer configuration ;)
  • you find this information, when looking at the tokenizer implementation from Transformers:

ELECTRA tokenizer is based on BERT tokenizer:

https://github.com/huggingface/transformers/blob/8e13b7359388882d93af5fe312efe56b6556fa23/src/transformers/models/electra/tokenization_electra.py#L52

And the BERT tokenizer comes with this default argument:

https://github.com/huggingface/transformers/blob/8e13b7359388882d93af5fe312efe56b6556fa23/src/transformers/models/bert/tokenization_bert.py#L167

So if you have an ELECTRA model, that is cased you need to specify it in the tokenizer_config.json, like it is done for our recently released German ELECTRA models:

https://huggingface.co/deepset/gelectra-large/blob/main/tokenizer_config.json

@stefan-it
Copy link
Collaborator

For example, let's say what is the tokenizer doing for the dbmz-cased model? :)

Good question, it uses the BERT tokenizer implementation (as mentioned above) and also has a tokenizer configuration:

https://huggingface.co/dbmdz/bert-base-german-cased/blob/main/tokenizer_config.json

that disables lowercasing.

@stefan-it
Copy link
Collaborator

Could you provide a more detailed example for the punctuation problem 🤔

I tried:

In [6]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
In [7]: tokenizer.tokenize("Obst,")
Out[7]: ['obst', ',']

so far.

@datistiquo
Copy link
Author

I tried:

In [6]: tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
In [7]: tokenizer.tokenize("Obst,")
Out[7]: ['obst', ',']

so far.

So, punctuations are getting own embedings? I dont want that a single comma influences my finetuning results. Maybe I have to remove such things beforehand then?

@stefan-it
Copy link
Collaborator

So, punctuations are getting own embedings? I dont want that a single comma influences my finetuning results. Maybe I have to remove such things beforehand then?

Exactly, so I highly recommend that you perform some kind of preprocessing (that include punctuation removal, or even removing stop words).

@dbmdz dbmdz locked and limited conversation to collaborators Feb 12, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants