StreamDataset and bptt #29

liujiqiang999 · 2019-03-04T16:02:25Z

Hi, thanks for your work. The following code line confused me:

XLM/src/data/dataset.py

Line 36 in 1bf99af

buffer[t_size - n_tokens:] = sent

XLM/src/data/dataset.py

Line 37 in 1bf99af

buffer = buffer.reshape((bs, n_batches * bptt)).T

Which means cut off complete sentences. What is the advantage of doing this? How to set Hyper-parameter bptt.

glample · 2019-03-04T19:13:40Z

No, we don't cut anything. Instead, we add padding tokens to have a number of tokens which is a multiple of n_batches * bptt.

Bptt is basically the sequence length (truncated back propagation through time, maybe the term is not well chosen here..). For the good value it depends on what you want to do with it. If you train on language modeling task, for a downstream task where sentences are less than 200 words, then 200 is good. In practice we usually use 256 or 512 (be sure it's a multiple of 8 if you use fp16).

liujiqiang999 · 2019-03-05T05:35:15Z

I have limited the maximum length of sentences to 100 before training. Could I set bptt to 128? Does it have a big impact on performance? What differences between class StreamDataset and class DataSet. Thank you very much.

liujiqiang999 · 2019-03-05T05:50:06Z

I noticed that params.split_data is False when we use multi-gpus to training. Why do you do it in that way?

glample · 2019-03-05T12:54:55Z

StreamDataset returns continuous streams of sentences of size (bptt, batch_size). You can have an arbitrary number of sentences in a batch. While Dataset returns one and only one sentence per sequence. Dataset uses padding to pad sentences of different lengths, StreamDataset does not use and padding. Bptt = 128 is fine but if you have sentences longer than that then it won't work well because the position embeddings for these sentences won't be properly trained. If you set max_len <= 128 then you should be fine.

params.split_data = True is good when your dataset is so big that you cannot load it 8 times in memory.

liujiqiang999 · 2019-03-08T11:18:38Z

Thank you very much！

glample closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamDataset and bptt #29

StreamDataset and bptt #29

liujiqiang999 commented Mar 4, 2019

glample commented Mar 4, 2019

liujiqiang999 commented Mar 5, 2019 •

edited

Loading

liujiqiang999 commented Mar 5, 2019

glample commented Mar 5, 2019

liujiqiang999 commented Mar 8, 2019

StreamDataset and bptt #29

StreamDataset and bptt #29

Comments

liujiqiang999 commented Mar 4, 2019

glample commented Mar 4, 2019

liujiqiang999 commented Mar 5, 2019 • edited Loading

liujiqiang999 commented Mar 5, 2019

glample commented Mar 5, 2019

liujiqiang999 commented Mar 8, 2019

liujiqiang999 commented Mar 5, 2019 •

edited

Loading