Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gaussian broadening to SLIT source #214

Merged
merged 1 commit into from
Mar 10, 2024

Conversation

lkeegan
Copy link
Contributor

@lkeegan lkeegan commented Mar 8, 2024

- `param2.x`: width of gaussian broadening in direction perpendicular to both slit and v
- `param2.y`: width of gaussian broadening in direction of slit
@fangq
Copy link
Owner

fangq commented Mar 9, 2024

@lkeegan, thanks for the patch. I think this is a great extension to the slit source and I'd be happy to include it.

I do have a few minor requests

  1. can you also update the documentation on the slit source in mcxlab.m with the new options?
  2. I noticed that you defined 9 new registers in this new code block. by using the make register make command that I just added in the Makefile, I noticed that there is an increase of register counts in 4 out of 16 kernel templates. Here is the register count diff
fangq@taote:~$ diff reg_original.txt reg_newslit.txt 
42c42
< ptxas info    : Used 103 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 106 registers, 472 bytes cmem[0], 324 bytes cmem[2]
50c50
< ptxas info    : Used 105 registers, 472 bytes cmem[0], 388 bytes cmem[2]
---
> ptxas info    : Used 110 registers, 472 bytes cmem[0], 368 bytes cmem[2]
54c54
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 336 bytes cmem[2]
---
> ptxas info    : Used 98 registers, 472 bytes cmem[0], 336 bytes cmem[2]
58c58
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
62c62
< ptxas info    : Used 128 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 126 registers, 472 bytes cmem[0], 332 bytes cmem[2]
66c66
< ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 100 registers, 472 bytes cmem[0], 324 bytes cmem[2]

the increase of register counts ranges from 3-5. It is not a huge deal, although currently MCX's speed is largely capped by the register counts, so any optimization of registers would give a better chance to run more simultaneous blocks.

I understand that the use of new registers is for improving readability, which is also important. I am wondering if it is possible for you to try reusing some of your registers, (of course, add comments to indicate the updated meaning during multiple uses).

If you pull again, and add the register target in your Makefile, you can also use make register to compare the register savings.

let me know if you have any trouble doing this.

thanks again

@fangq
Copy link
Owner

fangq commented Mar 10, 2024

@lkeegan, I managed to cut down the registers from 9 to only 2. let me merge this patch first and add my modifications, please test if the function still work as expected.

@fangq fangq merged commit 56eca8a into fangq:master Mar 10, 2024
27 checks passed
@fangq
Copy link
Owner

fangq commented Mar 10, 2024

here is a diff of register counts compared to pre-merged mcx

fangq@taote:~/space/git/Temp/mcx/src$ diff p1.txt p2.txt 
45c45
< ptxas info    : Used 130 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 129 registers, 472 bytes cmem[0], 332 bytes cmem[2]
49c49
< ptxas info    : Used 105 registers, 472 bytes cmem[0], 388 bytes cmem[2]
---
> ptxas info    : Used 106 registers, 472 bytes cmem[0], 368 bytes cmem[2]
53c53
< ptxas info    : Used 94 registers, 472 bytes cmem[0], 336 bytes cmem[2]
---
> ptxas info    : Used 98 registers, 472 bytes cmem[0], 336 bytes cmem[2]
61c61
< ptxas info    : Used 128 registers, 472 bytes cmem[0], 332 bytes cmem[2]
---
> ptxas info    : Used 122 registers, 472 bytes cmem[0], 332 bytes cmem[2]
65c65
< ptxas info    : Used 92 registers, 472 bytes cmem[0], 324 bytes cmem[2]
---
> ptxas info    : Used 100 registers, 472 bytes cmem[0], 324 bytes cmem[2]

3 templates still see growth of registers (1, 4, 8), compensated by drop of registers in 2 other templates (-1 and -6). tested on CUDA 10.2. other cuda versions may see different numbers.

@@ -1462,6 +1462,29 @@ __device__ inline int launchnewphoton(MCXpos* p, MCXdir* v, Stokes* s, MCXtime*
rotatevector(v, 1.f, 0.f, sphi, cphi);
}

if (MCX_SRC_SLIT && (launchsrc->param2.x > 0.f || launchsrc->param2.y > 0.f)) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first condition should be gcfg->srctype == MCX_SRC_SLIT. I will post my fix, just leave here for a note.

fangq added a commit that referenced this pull request Mar 10, 2024
@lkeegan
Copy link
Contributor Author

lkeegan commented Mar 11, 2024

@fangq many thanks for the quick review and the performance improvements & fixes to the PR!

I tested the master branch and it all works great, thanks!

If you are interested I also tried make register locally with cuda 11.8 and 12.4, and for these versions there was no change in the number of registers used between any of the variants of the code.

Also, if reducing the number of registers used is key for good performance then it looks like separating the parallel and perpendicular contributions to v allows more register re-use (using cuda 10.2, sm_35):

@fangq
Copy link
Owner

fangq commented Mar 12, 2024

thanks, I wasn't aware of this compiler explorer feature, thanks for sharing it!

yes, if you run nsight-compute to profile mcx kernels, you can see that the current throughput is largely limited by register counts (similarly for mcxcl and mmc). mcx has minimal memory overhead and thread divergence.

from the godbolt page, it seems the output was generated from -Xptxas -v, which is the same way as make register. I am afraid implementing just this function alone won't report the same register use as having it as part of the larger mcx kernel, because the number of registers that can be reused or optimized is dependent on the context. I also use -O3 in the compilation which could also be a bit aggressive. if you see other key factors that can reduce registers in the larger mcx kernels, I would love to hear about it.

on the flip side, I would not go to the other extreme of destroy readability to trade small gains in registers either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gaussian divergence for slit source
2 participants