{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":303828715,"defaultBranch":"master","name":"pytorch","ownerLogin":"b-koopman","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2020-10-13T21:05:45.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/24925828?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1617666414.153402","currentOid":""},"activityList":{"items":[{"before":"dac680721c93ef362f5ff7afc4fa863ec5ebfaab","after":"8a744c31d3c5194b4850869a112d94912c1e08b4","ref":"refs/heads/master","pushedAt":"2023-06-12T03:41:56.166Z","pushType":"push","commitsCount":10000,"pusher":{"login":"b-koopman","name":null,"path":"/b-koopman","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/24925828?s=80&v=4"},"commit":{"message":"Up to 48% speed up using Sklansky method for innermost prefix scan algorithm (#103314)\n\nI found this algorithm (Sklansky) could provide speed-up over the previously implemented Brent-Kung (BK) algorithm. In BK algorithm, the sweeps are done twice: up-sweep and down-sweep. In up-sweep, initially all threads are working, but then half of the working threads becomes inactive in the subsequent step. Similarly for down-sweep but the other way around, where it initially starts with only 1 working thread and double the number of working threads for each sweep. This results of half of the thread is idle on average and produces `2 * log2(num_threads_x)` sweep steps.\n\nOn the other hand, Sklansky algorithm only use 1 sweep and in each step of the sweep, all the threads are working. This algorithm also produces `log2(num_threads_x)` sweep steps which is half of the BK algorithm. That provides the speed up. I follow the schematics of Sklansky algorithm provided in [this paper](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf). The same paper provides a much better algorithm (the one implemented in CUB), but I haven't got my head around it, while the Sklansky algorithm is easier to digest and implement.\n\nHere are the speed up from my experiment using `cumsum` in the innermost dimension using A100:\n(UPDATE: the newest commit further optimize it up to 76% on `8 x 4000` matrix)\n(UPDATE: added shapes with 2048 and 1M in its elements)\n| Shape        | Torch cumsum              | Custom cumsum            | Speed up            |\n|--------------|---------------------------|--------------------------|---------------------|\n| (2, 1000)    | 4.8112869262695315e-05   | 2.849102020263672e-05    | 1.688702928870293   |\n| (8, 4000)    | 0.00017731189727783204   | 0.0001005411148071289    | 1.7635760018970834  |\n| (128, 10000) | 0.0005342483520507813    | 0.00035474300384521487   | 1.5060151891928222  |\n| (1024, 20000)| 0.0014238595962524415    | 0.0010990619659423829    | 1.2955225823246128  |\n| (1024, 100000)| 0.007089591026306153    | 0.005468320846557617     | 1.296484099093993   |\n| (2048, 1000000)| 0.058730244636535645 | 0.0458010196685791 | 1.2822912035913994 |\n| (1000, 2)    | 1.0919570922851562e-05   | 8.106231689453125e-06    | 1.3470588235294116  |\n| (4000, 8)    | 9.512901306152343e-06    | 7.867813110351562e-06    | 1.209090909090909   |\n| (10000, 128) | 2.079010009765625e-05    | 1.6164779663085937e-05   | 1.2861356932153394  |\n| (20000, 1024)| 0.00024993419647216796   | 0.00017964839935302734   | 1.3912408759124086  |\n| (100000, 1024)| 0.0011160612106323243   | 0.0009322404861450195    | 1.1971816577581138  |\n| (1000000, 2048) | 0.017030668258666993 | 0.014445066452026367 | 1.178995494082889 |\n\nPull Request resolved: https://github.com/pytorch/pytorch/pull/103314\nApproved by: https://github.com/ngimel","shortMessageHtmlLink":"Up to 48% speed up using Sklansky method for innermost prefix scan al…"}},{"before":"dac680721c93ef362f5ff7afc4fa863ec5ebfaab","after":"8a744c31d3c5194b4850869a112d94912c1e08b4","ref":"refs/heads/master","pushedAt":"2023-06-12T03:41:56.114Z","pushType":"push","commitsCount":10000,"pusher":{"login":"b-koopman","name":null,"path":"/b-koopman","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/24925828?s=80&v=4"},"commit":{"message":"Up to 48% speed up using Sklansky method for innermost prefix scan algorithm (#103314)\n\nI found this algorithm (Sklansky) could provide speed-up over the previously implemented Brent-Kung (BK) algorithm. In BK algorithm, the sweeps are done twice: up-sweep and down-sweep. In up-sweep, initially all threads are working, but then half of the working threads becomes inactive in the subsequent step. Similarly for down-sweep but the other way around, where it initially starts with only 1 working thread and double the number of working threads for each sweep. This results of half of the thread is idle on average and produces `2 * log2(num_threads_x)` sweep steps.\n\nOn the other hand, Sklansky algorithm only use 1 sweep and in each step of the sweep, all the threads are working. This algorithm also produces `log2(num_threads_x)` sweep steps which is half of the BK algorithm. That provides the speed up. I follow the schematics of Sklansky algorithm provided in [this paper](https://research.nvidia.com/sites/default/files/pubs/2016-03_Single-pass-Parallel-Prefix/nvr-2016-002.pdf). The same paper provides a much better algorithm (the one implemented in CUB), but I haven't got my head around it, while the Sklansky algorithm is easier to digest and implement.\n\nHere are the speed up from my experiment using `cumsum` in the innermost dimension using A100:\n(UPDATE: the newest commit further optimize it up to 76% on `8 x 4000` matrix)\n(UPDATE: added shapes with 2048 and 1M in its elements)\n| Shape        | Torch cumsum              | Custom cumsum            | Speed up            |\n|--------------|---------------------------|--------------------------|---------------------|\n| (2, 1000)    | 4.8112869262695315e-05   | 2.849102020263672e-05    | 1.688702928870293   |\n| (8, 4000)    | 0.00017731189727783204   | 0.0001005411148071289    | 1.7635760018970834  |\n| (128, 10000) | 0.0005342483520507813    | 0.00035474300384521487   | 1.5060151891928222  |\n| (1024, 20000)| 0.0014238595962524415    | 0.0010990619659423829    | 1.2955225823246128  |\n| (1024, 100000)| 0.007089591026306153    | 0.005468320846557617     | 1.296484099093993   |\n| (2048, 1000000)| 0.058730244636535645 | 0.0458010196685791 | 1.2822912035913994 |\n| (1000, 2)    | 1.0919570922851562e-05   | 8.106231689453125e-06    | 1.3470588235294116  |\n| (4000, 8)    | 9.512901306152343e-06    | 7.867813110351562e-06    | 1.209090909090909   |\n| (10000, 128) | 2.079010009765625e-05    | 1.6164779663085937e-05   | 1.2861356932153394  |\n| (20000, 1024)| 0.00024993419647216796   | 0.00017964839935302734   | 1.3912408759124086  |\n| (100000, 1024)| 0.0011160612106323243   | 0.0009322404861450195    | 1.1971816577581138  |\n| (1000000, 2048) | 0.017030668258666993 | 0.014445066452026367 | 1.178995494082889 |\n\nPull Request resolved: https://github.com/pytorch/pytorch/pull/103314\nApproved by: https://github.com/ngimel","shortMessageHtmlLink":"Up to 48% speed up using Sklansky method for innermost prefix scan al…"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAADP0WLXgA","startCursor":null,"endCursor":null}},"title":"Activity · b-koopman/pytorch"}