-
Notifications
You must be signed in to change notification settings - Fork 498
StreamDataset and bptt #29
Comments
No, we don't cut anything. Instead, we add padding tokens to have a number of tokens which is a multiple of n_batches * bptt. Bptt is basically the sequence length (truncated back propagation through time, maybe the term is not well chosen here..). For the good value it depends on what you want to do with it. If you train on language modeling task, for a downstream task where sentences are less than 200 words, then 200 is good. In practice we usually use 256 or 512 (be sure it's a multiple of 8 if you use fp16). |
I have limited the maximum length of sentences to 100 before training. Could I set bptt to 128? Does it have a big impact on performance? What differences between class StreamDataset and class DataSet. Thank you very much. |
I noticed that params.split_data is False when we use multi-gpus to training. Why do you do it in that way? |
StreamDataset returns continuous streams of sentences of size (bptt, batch_size). You can have an arbitrary number of sentences in a batch. While Dataset returns one and only one sentence per sequence. Dataset uses padding to pad sentences of different lengths, StreamDataset does not use and padding. Bptt = 128 is fine but if you have sentences longer than that then it won't work well because the position embeddings for these sentences won't be properly trained. If you set max_len <= 128 then you should be fine.
|
Thank you very much! |
Hi, thanks for your work. The following code line confused me:
XLM/src/data/dataset.py
Line 36 in 1bf99af
XLM/src/data/dataset.py
Line 37 in 1bf99af
Which means cut off complete sentences. What is the advantage of doing this? How to set Hyper-parameter bptt.
The text was updated successfully, but these errors were encountered: