Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix pos embedding index bug #144

Merged
merged 6 commits into from
Aug 4, 2021
Merged

Conversation

nomadlx
Copy link
Contributor

@nomadlx nomadlx commented Aug 2, 2021

Fixed the implementation of position embedding

  • the size of position matrix is determined by max_positions parameter
  • ignore all the padding tokens when calculating the token position
  • the position index begin from padding_idx + 1, consistent with fairseq implementation

@Taka152
Copy link
Contributor

Taka152 commented Aug 3, 2021

Thanks for your contribution. You are the first contributor from the community😆, we'll review asap.

@godweiyang
Copy link
Collaborator

godweiyang commented Aug 3, 2021

Thanks for your meaningful contribution!

According to my understanding, your pull request has 4 improvements:

  1. Let pos_emb[pad_id] = 0, which is updated to the latest version of Fairseq.
  2. Modify the pos_id of the first non-padding token to pad_id + 1.
  3. Modify the MAX_POSITION = 300 to max_position + pad_id + 1 = 1027.
  4. Fix the left padding position index bug.

What I want to say is that the design principle of LightSeq is general and not related to the training framework (e.g., Fairseq), so the first 3 points are to be consistent with Fairseq, which I think it is not very necessary. And in actual training, I found that position offset by several indexes will not affect the effect. If you really want to be completely consistent with Fairseq, you should modify the modules under examples/training/fairseq/fs_modules to allow Fairseq to adapt to LightSeq.

About the third point, I think if using the default max_position = 1024 in Fairseq, LightSeq will pre-allocate much more GPU memory, and the users may encounter OOM error. So we specify a small value (300) here to meet the needs of most NLP datasets.

The last point is indeed a bug in LightSeq and we did not consider the left padding in previous version.

So can you consider only merging the last point using the general position embedding (i.e., ignore position embedding of padding, and make the position index of the first non-padding token be equal to 0)?

Or if you have some better idea, you can discuss with us in FeiShu user open group.

Thank you again for your contribution and look forward to your reply!

@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 3, 2021

Thanks for your meaningful contribution!

According to my understanding, your pull request has 4 improvements:

  1. Let pos_emb[pad_id] = 0, which is updated to the latest version of Fairseq.
  2. Modify the pos_id of the first non-padding token to pad_id + 1.
  3. Modify the MAX_POSITION = 300 to max_position + pad_id + 1 = 1027.
  4. Fix the left padding position index bug.

What I want to say is that the design principle of LightSeq is general and not related to the training framework (e.g., Fairseq), so the first 3 points are to be consistent with Fairseq, which I think it is not very necessary. And in actual training, I found that position offset by several indexes will not affect the effect. If you really want to be completely consistent with Fairseq, you should modify the modules under examples/training/fairseq/fs_modules to allow Fairseq to adapt to LightSeq.

About the third point, I think if using the default max_position = 1024 in Fairseq, LightSeq will pre-allocate much more GPU memory, and the users may encounter OOM error. So we specify a small value (300) here to meet the needs of most NLP datasets.

The last point is indeed a bug in LightSeq and we did not consider the left padding in previous version.

So can you consider only merging the last point using the general position embedding (i.e., ignore position embedding of padding, and make the position index of the first non-padding token be equal to 0)?

Or if you have some better idea, you can discuss with us in FeiShu user open group.

Thank you again for your contribution and look forward to your reply!

Thank you for your review suggestions.

Regarding the first two points, it is indeed not necessary and I will undo those modification.

About the third point, I think that the parameters specified by the user should have a higher priority, but it involves modifying the default parameters of the translation task. Maybe it more appropriate to set the corresponding parameter to 300 or min(MAX_SEQ_LENGTH, args.max_source_positions) in the model architecture parameter(eg: base_architecture) and keep the modification here? I think this setting is also easier for users to discover.

The fourth point will keep the same.

@godweiyang
Copy link
Collaborator

Thanks for your meaningful contribution!
According to my understanding, your pull request has 4 improvements:

  1. Let pos_emb[pad_id] = 0, which is updated to the latest version of Fairseq.
  2. Modify the pos_id of the first non-padding token to pad_id + 1.
  3. Modify the MAX_POSITION = 300 to max_position + pad_id + 1 = 1027.
  4. Fix the left padding position index bug.

What I want to say is that the design principle of LightSeq is general and not related to the training framework (e.g., Fairseq), so the first 3 points are to be consistent with Fairseq, which I think it is not very necessary. And in actual training, I found that position offset by several indexes will not affect the effect. If you really want to be completely consistent with Fairseq, you should modify the modules under examples/training/fairseq/fs_modules to allow Fairseq to adapt to LightSeq.
About the third point, I think if using the default max_position = 1024 in Fairseq, LightSeq will pre-allocate much more GPU memory, and the users may encounter OOM error. So we specify a small value (300) here to meet the needs of most NLP datasets.
The last point is indeed a bug in LightSeq and we did not consider the left padding in previous version.
So can you consider only merging the last point using the general position embedding (i.e., ignore position embedding of padding, and make the position index of the first non-padding token be equal to 0)?
Or if you have some better idea, you can discuss with us in FeiShu user open group.
Thank you again for your contribution and look forward to your reply!

Thank you for your review suggestions.

Regarding the first two points, it is indeed not necessary and I will undo those modification.

About the third point, I think that the parameters specified by the user should have a higher priority, but it involves modifying the default parameters of the translation task. Maybe it more appropriate to set the corresponding parameter to 300 or min(MAX_SEQ_LENGTH, args.max_source_positions) in the model architecture parameter(eg: base_architecture) and keep the modification here? I think this setting is also easier for users to discover.

The fourth point will keep the same.

Yes I think it makes sense. Looking forward to your changes to the last two points~

@godweiyang
Copy link
Collaborator

Can you try the unit test of embedding layer by running python3 tests/test_ls_ops.py? I encounter an error "RuntimeError: CUDA error: invalid argument".

@godweiyang
Copy link
Collaborator

Can you try the unit test of embedding layer by running python3 tests/test_ls_ops.py? I encounter an error "RuntimeError: CUDA error: invalid argument".

Another small bug is that if seq_len > 1024, the block_dim in Line 231 will be greater than 1024?

@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 3, 2021

Can you try the unit test of embedding layer by running python3 tests/test_ls_ops.py? I encounter an error "RuntimeError: CUDA error: invalid argument".

It looks like the dynamic shared memory is out of limit. I shouldn't be using shared memory to store such a large matrix. I'll fix this later.

@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 3, 2021

Can you try the unit test of embedding layer by running python3 tests/test_ls_ops.py? I encounter an error "RuntimeError: CUDA error: invalid argument".

Another small bug is that if seq_len > 1024, the block_dim in Line 231 will be greater than 1024?

This does not seem to exceed 1024, because I took the minimum of the two. Or do I not understand what you really mean?

@godweiyang
Copy link
Collaborator

Can you try the unit test of embedding layer by running python3 tests/test_ls_ops.py? I encounter an error "RuntimeError: CUDA error: invalid argument".

Another small bug is that if seq_len > 1024, the block_dim in Line 231 will be greater than 1024?

This does not seem to exceed 1024, because I took the minimum of the two. Or do I not understand what you really mean?

For example, seq_len = 2048, then block_x_tokens = 1024, block_y_tokens = 2, so block_dim = 2048 > 1024

@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 3, 2021

您可以通过运行来尝试嵌入层的单元测试python3 tests/test_ls_ops.py吗?我遇到错误“运行时错误:CUDA 错误:参数无效”。

另一个小错误是,如果seq_len > 1024,第block_dim231 行中的将大于 1024?

这个好像没有超过1024,因为我取了两者中的最小值。还是我不明白你的真正意思?

比如seq_len = 2048,那么block_x_tokens = 1024,block_y_tokens = 2,所以block_dim = 2048 > 1024

I get it. I'll fix it.

@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 4, 2021

I fixed the two bugs mentioned above, and the related tests are now passed. Sorry, I ignored the unit test before.

@godweiyang
Copy link
Collaborator

I fixed the two bugs mentioned above, and the related tests are now passed. Sorry, I ignored the unit test before.

Thanks a lot! I just passed the unit test too, using both left and right padding with seq_len <= 2048. We will review and merge your code soon~

@godweiyang godweiyang merged commit 913b999 into bytedance:master Aug 4, 2021
@nomadlx
Copy link
Contributor Author

nomadlx commented Aug 4, 2021

Thank you for your formatting the code and supplementing the corresponding unit tests. This is my first push, and many details have been overlooked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants