Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact training corpus #4

Closed
malteos opened this issue Feb 17, 2022 · 9 comments
Closed

Exact training corpus #4

malteos opened this issue Feb 17, 2022 · 9 comments

Comments

@malteos
Copy link

malteos commented Feb 17, 2022

Hi @bminixhofer

thanks for sharing your work. Could you provide more details on the training corpus?

In the paper, you write

we restrict the amount of training data to subsets of 4GiB from the OSCAR corpus (Ortiz Suárez et al., 2019)

What exact subset do you use? The unshuffled dedub versions (e.g., unshuffled_deduplicated_de)? Any random n samples with a specific seed? Or just the first/last n rows?

Best,
Malte

@bminixhofer
Copy link
Member

Hi Malte, sure!

First off, I want to mention: we initially restricted the dataset size to 4 GiB to make sure we don't need huge amounts of data in the target language, and for practical reasons w.r.t disk space. This stayed there mainly for historic reasons since in the meantime we have dedicated evaluation on low resource languages, making this restriction kind of obsolete. So if you plan to train a new model I'd recommend to train on the full corpus.

To answer your question: We used the first 4GiB of e.g. unshuffled_deduplicated_de as training data, then the next 4GiB * 0.1 as validation data. I've uploaded the old script we used to prepare this data under legacy/prepare.py which should make it possible to exactly reproduce the dataset.

Let me know if you have any more questions!

@malteos
Copy link
Author

malteos commented Feb 17, 2022

Thanks! This is what I was looking for.

@malteos malteos closed this as completed Feb 17, 2022
@malteos
Copy link
Author

malteos commented Mar 14, 2022

I've followed your instruction and created the 4GB train file which corresponds to 1,700,699 examples (with a custom gpt2-tokenizer trained on the same data). Given the 512 batch size, how do you then end up with 250k training steps? Or are you training for multiple epochs?

@malteos malteos reopened this Mar 14, 2022
@bminixhofer
Copy link
Member

bminixhofer commented Mar 14, 2022

Yes, we're training for many epochs! I just checked, for German it was ~ 75.

Also, I've just made the wandb project public which should make it easier if you're aiming to reproduce some results. For example, this is the German WECHSEL GPT2 run from the paper: https://wandb.ai/llms-transfer-learning/main/runs/14txxjm8 (I used this project to internally track everything so it is not very cleaned up).

And as I mentioned previously, I would recommend to only use the 4GB limit if you're aiming to reproduce some results from the paper, otherwise I'd go with no limit.

@malteos
Copy link
Author

malteos commented Mar 14, 2022

What is the reason for that many epochs? Did you do any ablation with less epochs on a larger dataset?

@bminixhofer
Copy link
Member

As I mentioned above:

We initially restricted the dataset size to 4 GiB to make sure we don't need huge amounts of data in the target language, and for practical reasons w.r.t disk space. This stayed there mainly for historic reasons since in the meantime we have dedicated evaluation on low resource languages, making this restriction kind of obsolete.

To go into a bit more detail: Back when we started the experiments, Google's Cloud TPU VMs were restricted to 96GB disk space and it was not possible to attach extra disks (thankfully in the meantime we can do this!). Including model checkpoints, there wouldn't have been space for much more data. That's of course not ideal and in hindsight we should've maybe spent some more engineering effort to bypass this restriction. However, we don't see any signs of overfitting even when training for that many epochs, and the results from WECHSEL hold when training on more data (I'm verifying this by training models for Ukrainian just now).

I don't know what exactly you're trying to use WECHSEL for. If you're trying to reproduce our results, you can train with the restricted corpus for a large number of epochs as we did in the paper. If you're training a new model for something else, I'd recommend just using as much data as possible.

@malteos
Copy link
Author

malteos commented Mar 14, 2022

Thanks for the clarification. And yes, reproducing your work first and then using it for a new model.

@malteos malteos closed this as completed Mar 14, 2022
@bminixhofer
Copy link
Member

Hey, I just came across https://github.com/malteos/german-gpt. Awesome work! I guess you were able to reproduce our results.

I guess https://huggingface.co/malteos/gpt2-xl-wechsel-german is now the largest public German LM. Do you have a twitter handle / do you mind if I promote it on twitter?

@malteos
Copy link
Author

malteos commented Jun 24, 2022

Probably yes.. or at least I'm not aware of any other model ;)

Feel free to promote it https://twitter.com/XYOU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants