Exact training corpus #4

malteos · 2022-02-17T10:33:24Z

thanks for sharing your work. Could you provide more details on the training corpus?

In the paper, you write

we restrict the amount of training data to subsets of 4GiB from the OSCAR corpus (Ortiz Suárez et al., 2019)

What exact subset do you use? The unshuffled dedub versions (e.g., unshuffled_deduplicated_de)? Any random n samples with a specific seed? Or just the first/last n rows?

Best,
Malte

bminixhofer · 2022-02-17T10:58:16Z

Hi Malte, sure!

First off, I want to mention: we initially restricted the dataset size to 4 GiB to make sure we don't need huge amounts of data in the target language, and for practical reasons w.r.t disk space. This stayed there mainly for historic reasons since in the meantime we have dedicated evaluation on low resource languages, making this restriction kind of obsolete. So if you plan to train a new model I'd recommend to train on the full corpus.

To answer your question: We used the first 4GiB of e.g. unshuffled_deduplicated_de as training data, then the next 4GiB * 0.1 as validation data. I've uploaded the old script we used to prepare this data under legacy/prepare.py which should make it possible to exactly reproduce the dataset.

Let me know if you have any more questions!

malteos · 2022-02-17T11:01:04Z

Thanks! This is what I was looking for.

malteos · 2022-03-14T15:53:58Z

I've followed your instruction and created the 4GB train file which corresponds to 1,700,699 examples (with a custom gpt2-tokenizer trained on the same data). Given the 512 batch size, how do you then end up with 250k training steps? Or are you training for multiple epochs?

bminixhofer · 2022-03-14T16:12:05Z

Yes, we're training for many epochs! I just checked, for German it was ~ 75.

Also, I've just made the wandb project public which should make it easier if you're aiming to reproduce some results. For example, this is the German WECHSEL GPT2 run from the paper: https://wandb.ai/llms-transfer-learning/main/runs/14txxjm8 (I used this project to internally track everything so it is not very cleaned up).

And as I mentioned previously, I would recommend to only use the 4GB limit if you're aiming to reproduce some results from the paper, otherwise I'd go with no limit.

malteos · 2022-03-14T22:07:11Z

What is the reason for that many epochs? Did you do any ablation with less epochs on a larger dataset?

bminixhofer · 2022-03-14T22:19:40Z

As I mentioned above:

We initially restricted the dataset size to 4 GiB to make sure we don't need huge amounts of data in the target language, and for practical reasons w.r.t disk space. This stayed there mainly for historic reasons since in the meantime we have dedicated evaluation on low resource languages, making this restriction kind of obsolete.

To go into a bit more detail: Back when we started the experiments, Google's Cloud TPU VMs were restricted to 96GB disk space and it was not possible to attach extra disks (thankfully in the meantime we can do this!). Including model checkpoints, there wouldn't have been space for much more data. That's of course not ideal and in hindsight we should've maybe spent some more engineering effort to bypass this restriction. However, we don't see any signs of overfitting even when training for that many epochs, and the results from WECHSEL hold when training on more data (I'm verifying this by training models for Ukrainian just now).

I don't know what exactly you're trying to use WECHSEL for. If you're trying to reproduce our results, you can train with the restricted corpus for a large number of epochs as we did in the paper. If you're training a new model for something else, I'd recommend just using as much data as possible.

malteos · 2022-03-14T22:30:40Z

Thanks for the clarification. And yes, reproducing your work first and then using it for a new model.

bminixhofer · 2022-06-24T10:41:28Z

Hey, I just came across https://github.com/malteos/german-gpt. Awesome work! I guess you were able to reproduce our results.

I guess https://huggingface.co/malteos/gpt2-xl-wechsel-german is now the largest public German LM. Do you have a twitter handle / do you mind if I promote it on twitter?

malteos · 2022-06-24T10:47:44Z

Probably yes.. or at least I'm not aware of any other model ;)

Feel free to promote it https://twitter.com/XYOU

malteos closed this as completed Feb 17, 2022

malteos reopened this Mar 14, 2022

malteos closed this as completed Mar 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exact training corpus #4

Exact training corpus #4

malteos commented Feb 17, 2022

bminixhofer commented Feb 17, 2022

malteos commented Feb 17, 2022

malteos commented Mar 14, 2022

bminixhofer commented Mar 14, 2022 •

edited

malteos commented Mar 14, 2022

bminixhofer commented Mar 14, 2022

malteos commented Mar 14, 2022

bminixhofer commented Jun 24, 2022

malteos commented Jun 24, 2022

Exact training corpus #4

Exact training corpus #4

Comments

malteos commented Feb 17, 2022

bminixhofer commented Feb 17, 2022

malteos commented Feb 17, 2022

malteos commented Mar 14, 2022

bminixhofer commented Mar 14, 2022 • edited

malteos commented Mar 14, 2022

bminixhofer commented Mar 14, 2022

malteos commented Mar 14, 2022

bminixhofer commented Jun 24, 2022

malteos commented Jun 24, 2022

bminixhofer commented Mar 14, 2022 •

edited