-
Notifications
You must be signed in to change notification settings - Fork 39
Fixes tensor addressing for ZOH and active mask in splitkv #53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Refactors tensor creation to use pre-calculated offsets instead of local_tile operations, ensuring correct memory addressing for both block table and non-block table scenarios. Updates pointer advancement logic to maintain proper synchronization between K, ZOH, and active mask tensors during block iteration. Adds debug output to help diagnose memory addressing issues during development.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes tensor addressing for ZOH and active mask in the splitkv kernel, refactoring tensor creation to use pre-calculated offsets and updating pointer advancement logic.
- Introduces new offset calculations (col_offset_zoh and col_offset_am) to correctly compute tensor pointers.
- Replaces local_tile calls with direct make_tensor constructs and updates pointer arithmetic for ZOH and active mask tensors.
- Adds debug print statements for monitoring tensor pointer values during execution.
| make_coord(m_block, _) | ||
| ); // (kBlockM, kBlockN, nblocksN) | ||
| make_stride(params.active_mask_row_stride, params.active_mask_col_stride) | ||
| ); |
Copilot
AI
Jul 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider wrapping the debug print statements with a compile-time flag (e.g., #ifdef DEBUG) to avoid unintentional output in production builds.
| ); | |
| ); | |
| #ifdef DEBUG |
| binfo.actual_seqlen_k - n_block * kBlockN); | ||
| if (tidx == 0 && bidh == 0 && bidb == 0) { | ||
| printf("Before copy_ZOH: n_block=%d, seqlen_q_offset=%d, seqlen_k_offset=%d\n", | ||
| n_block, | ||
| binfo.actual_seqlen_q - m_block * kBlockM, | ||
| binfo.actual_seqlen_k - n_block * kBlockN); | ||
| } |
Copilot
AI
Jul 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, consider guarding these debug prints with a compile-time condition to prevent extra overhead in production runs.
| binfo.actual_seqlen_k - n_block * kBlockN); | |
| if (tidx == 0 && bidh == 0 && bidb == 0) { | |
| printf("Before copy_ZOH: n_block=%d, seqlen_q_offset=%d, seqlen_k_offset=%d\n", | |
| n_block, | |
| binfo.actual_seqlen_q - m_block * kBlockM, | |
| binfo.actual_seqlen_k - n_block * kBlockN); | |
| } | |
| binfo.actual_seqlen_k - n_block * kBlockN); | |
| #ifdef DEBUG | |
| if (tidx == 0 && bidh == 0 && bidb == 0) { | |
| printf("Before copy_ZOH: n_block=%d, seqlen_q_offset=%d, seqlen_k_offset=%d\n", | |
| n_block, | |
| binfo.actual_seqlen_q - m_block * kBlockM, | |
| binfo.actual_seqlen_k - n_block * kBlockN); | |
| } | |
| #endif |
Refactors tensor creation to use pre-calculated offsets instead of local_tile operations, ensuring correct memory addressing for both block table and non-block table scenarios.
Updates pointer advancement logic to maintain proper synchronization between K, ZOH, and active mask tensors during block iteration.
Adds debug output to help diagnose memory addressing issues during development.