Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

StreamDataset and bptt #29

Closed
liujiqiang999 opened this issue Mar 4, 2019 · 5 comments
Closed

StreamDataset and bptt #29

liujiqiang999 opened this issue Mar 4, 2019 · 5 comments

Comments

@liujiqiang999
Copy link

Hi, thanks for your work. The following code line confused me:

buffer[t_size - n_tokens:] = sent

buffer = buffer.reshape((bs, n_batches * bptt)).T

Which means cut off complete sentences. What is the advantage of doing this? How to set Hyper-parameter bptt.

@glample
Copy link
Contributor

glample commented Mar 4, 2019

No, we don't cut anything. Instead, we add padding tokens to have a number of tokens which is a multiple of n_batches * bptt.

Bptt is basically the sequence length (truncated back propagation through time, maybe the term is not well chosen here..). For the good value it depends on what you want to do with it. If you train on language modeling task, for a downstream task where sentences are less than 200 words, then 200 is good. In practice we usually use 256 or 512 (be sure it's a multiple of 8 if you use fp16).

@liujiqiang999
Copy link
Author

liujiqiang999 commented Mar 5, 2019

I have limited the maximum length of sentences to 100 before training. Could I set bptt to 128? Does it have a big impact on performance? What differences between class StreamDataset and class DataSet. Thank you very much.

@liujiqiang999
Copy link
Author

I noticed that params.split_data is False when we use multi-gpus to training. Why do you do it in that way?

@glample
Copy link
Contributor

glample commented Mar 5, 2019

StreamDataset returns continuous streams of sentences of size (bptt, batch_size). You can have an arbitrary number of sentences in a batch. While Dataset returns one and only one sentence per sequence. Dataset uses padding to pad sentences of different lengths, StreamDataset does not use and padding. Bptt = 128 is fine but if you have sentences longer than that then it won't work well because the position embeddings for these sentences won't be properly trained. If you set max_len <= 128 then you should be fine.

params.split_data = True is good when your dataset is so big that you cannot load it 8 times in memory.

@liujiqiang999
Copy link
Author

Thank you very much!

@glample glample closed this as completed Mar 8, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants