{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":731917142,"defaultBranch":"main","name":"llvm-project","ownerLogin":"dustanddreams","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2023-12-15T07:28:52.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/118974489?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1718895857.0","currentOid":""},"activityList":{"items":[{"before":"67226bad150785f64efcf53c79b7785d421fc8eb","after":"0255c48188801b20884bb6b2603d3af642782fba","ref":"refs/heads/main","pushedAt":"2024-06-21T07:24:29.000Z","pushType":"push","commitsCount":83,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"},"commit":{"message":"[mlir][Transforms] Dialect conversion: Remove workaround (#96186)\n\nThis commit removes a `FIXME` in the code base that was in place because\r\nof patterns that used the dialect conversion API incorrectly. Those\r\npatterns have been fixed and the workaround is no longer needed.","shortMessageHtmlLink":"[mlir][Transforms] Dialect conversion: Remove workaround (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2364373348\" data-permission-text=\"Title is private\" data-url=\"https://github.com/llvm/llvm-project/issues/96186\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/llvm/llvm-project/pull/96186/hovercard\" href=\"https://github.com/llvm/llvm-project/pull/96186\">llvm#96186</a>)"}},{"before":"633884748c82acf4026ad773a08fc44fba9ef3f9","after":null,"ref":"refs/heads/glibc_fstat_interceptor_fix","pushedAt":"2024-06-20T15:04:17.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"}},{"before":"8f7fdd94ef19af7b4905b316c253a78219a6038f","after":"67226bad150785f64efcf53c79b7785d421fc8eb","ref":"refs/heads/main","pushedAt":"2024-06-20T14:59:11.000Z","pushType":"push","commitsCount":10000,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"},"commit":{"message":"[Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit release (#91862)\n\n### Context\r\n\r\nWe have a longstanding performance issue on Windows, where to this day,\r\nthe default heap allocator is still lockfull. With the number of cores\r\nincreasing, building and using LLVM with the default Windows heap\r\nallocator is sub-optimal. Notably, the ThinLTO link times with LLD are\r\nextremely long, and increase proportionally with the number of cores in\r\nthe machine.\r\n\r\nIn\r\nhttps://github.com/llvm/llvm-project/commit/a6a37a2fcd2a8048a75bd0d8280497ed89d73224,\r\nI introduced the ability build LLVM with several popular lock-free\r\nallocators. Downstream users however have to build their own toolchain\r\nwith this option, and building an optimal toolchain is a bit tedious and\r\nlong. Additionally, LLVM is now integrated into Visual Studio, which\r\nAFAIK re-distributes the vanilla LLVM binaries/installer. The point\r\nbeing that many users are impacted and might not be aware of this\r\nproblem, or are unable to build a more optimal version of the toolchain.\r\n\r\nThe symptom before this PR is that most of the CPU time goes to the\r\nkernel (darker blue) when linking with ThinLTO:\r\n\r\n\r\n![16c_ryzen9_windows_heap](https://github.com/llvm/llvm-project/assets/37383324/86c3f6b9-6028-4c1a-ba60-a2fa3876fba7)\r\n\r\nWith this PR, most time is spent in user space (light blue):\r\n\r\n\r\n![16c_ryzen9_rpmalloc](https://github.com/llvm/llvm-project/assets/37383324/646b88f3-5b6d-485d-a2e4-15b520bdaf5b)\r\n\r\nOn higher core count machines, before this PR, the CPU usage becomes\r\npretty much flat because of contention:\r\n\r\n<img width=\"549\" alt=\"VM_176_windows_heap\"\r\nsrc=\"https://github.com/llvm/llvm-project/assets/37383324/f27d5800-ee02-496d-a4e7-88177e0727f0\">\r\n\r\n\r\nWith this PR, similarily most CPU time is now used:\r\n\r\n<img width=\"549\" alt=\"VM_176_with_rpmalloc\"\r\nsrc=\"https://github.com/llvm/llvm-project/assets/37383324/7d4785dd-94a7-4f06-9b16-aaa4e2e505c8\">\r\n\r\n### Changes in this PR\r\n\r\nThe avenue I've taken here is to vendor/re-licence rpmalloc in-tree, and\r\nuse it when building the Windows 64-bit release. Given the permissive\r\nrpmalloc licence, prior discussions with the LLVM foundation and\r\n@lattner suggested this vendoring. Rpmalloc's author (@mjansson) kindly\r\nagreed to ~~donate~~ re-licence the rpmalloc code in LLVM (please do\r\ncorrect me if I misinterpreted our past communications).\r\n\r\nI've chosen rpmalloc because it's small and gives the best value\r\noverall. The source code is only 4 .c files. Rpmalloc is statically\r\nreplacing the weak CRT alloc symbols at link time, and has no dynamic\r\npatching like mimalloc. As an alternative, there were several\r\nunsuccessfull attempts made by Russell Gallop to use SCUDO in the past,\r\nplease see thread in https://reviews.llvm.org/D86694. If later someone\r\ncomes up with a PR of similar performance that uses SCUDO, we could then\r\ndelete this vendored rpmalloc folder.\r\n\r\nI've added a new cmake flag `LLVM_ENABLE_RPMALLOC` which essentialy sets\r\n`LLVM_INTEGRATED_CRT_ALLOC` to the in-tree rpmalloc source.\r\n\r\n### Performance\r\n\r\nThe most obvious test is profling a ThinLTO linking step with LLD. I've\r\nused a Clang compilation as a testbed, ie.\r\n```\r\nset OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 -Xclang -O3 -fstrict-aliasing -march=native -flto=thin -fwhole-program-vtables -fuse-ld=lld\r\ncmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS=\"clang\" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS=\"%OPTS%\" -DCMAKE_C_FLAGS=\"%OPTS%\" -DLLVM_ENABLE_LTO=THIN\r\n```\r\nI've profiled the linking step with no LTO cache, with Powershell, such\r\nas:\r\n```\r\nMeasure-Command { lld-link /nologo @CMakeFiles\\clang.rsp /out:bin\\clang.exe /implib:lib\\clang.lib /pdb:bin\\clang.pdb /version:0.0 /machine:x64 /STACK:10000000 /DEBUG /OPT:REF /OPT:ICF /INCREMENTAL:NO /subsystem:console /MANIFEST:EMBED,ID=1 }`\r\n```\r\n\r\nTimings:\r\n\r\n| Machine | Allocator | Time to link |\r\n|--------|--------|--------|\r\n| 16c/32t AMD Ryzen 9 5950X | Windows Heap | 10 min 38 sec |\r\n|  | **Rpmalloc** | **4 min 11 sec** |\r\n| 32c/64t AMD Ryzen Threadripper PRO 3975WX | Windows Heap | 23 min 29\r\nsec |\r\n|  | **Rpmalloc** | **2 min 11 sec** |\r\n|  | **Rpmalloc + /threads:64** | **1 min 50 sec** |\r\n| 176 vCPU (2 socket) Intel Xeon Platinium 8481C (fixed clock 2.7 GHz) |\r\nWindows Heap | 43 min 40 sec |\r\n|  | **Rpmalloc** | **1 min 45 sec** |\r\n\r\nThis also improves the overall performance when building with clang-cl.\r\nI've profiled a regular compilation of clang itself, ie:\r\n```\r\nset OPTS=/GS- /D_ITERATOR_DEBUG_LEVEL=0 /arch:AVX -fuse-ld=lld\r\ncmake -G Ninja %ROOT%/llvm -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=TRUE -DLLVM_ENABLE_PROJECTS=\"clang;lld\" -DLLVM_ENABLE_PDB=ON -DLLVM_OPTIMIZED_TABLEGEN=ON -DCMAKE_C_COMPILER=clang-cl.exe -DCMAKE_CXX_COMPILER=clang-cl.exe -DCMAKE_LINKER=lld-link.exe -DLLVM_ENABLE_LLD=ON -DCMAKE_CXX_FLAGS=\"%OPTS%\" -DCMAKE_C_FLAGS=\"%OPTS%\"\r\n```\r\nThis saves approx. 30 sec when building on the Threadripper PRO 3975WX:\r\n```\r\n(default Windows Heap)\r\nC:\\src\\git\\llvm-project>hyperfine -r 5 -p \"make_llvm.bat stage1_test2\" \"ninja clang -C stage1_test2\"\r\nBenchmark 1: ninja clang -C stage1_test2\r\n  Time (mean ± σ):     392.716 s ±  3.830 s    [User: 17734.025 s, System: 1078.674 s]\r\n  Range (min … max):   390.127 s … 399.449 s    5 runs\r\n\r\n(rpmalloc)\r\nC:\\src\\git\\llvm-project>hyperfine -r 5 -p \"make_llvm.bat stage1_test2\" \"ninja clang -C stage1_test2\"\r\nBenchmark 1: ninja clang -C stage1_test2\r\n  Time (mean ± σ):     360.824 s ±  1.162 s    [User: 15148.637 s, System: 905.175 s]\r\n  Range (min … max):   359.208 s … 362.288 s    5 runs\r\n```","shortMessageHtmlLink":"[Support] Vendor rpmalloc in-tree and use it for the Windows 64-bit r…"}},{"before":"35c19fdde2583e74d940f6cd47b97a5c28bfe368","after":"8f7fdd94ef19af7b4905b316c253a78219a6038f","ref":"refs/heads/main","pushedAt":"2024-01-17T11:49:38.000Z","pushType":"push","commitsCount":466,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"},"commit":{"message":"[clang][AST] Invalidate DecompositionDecl if it has invalid initializer. (#72428)\n\nFix #67495, #72198\r\n\r\nWe build ill-formed AST nodes for invalid structured binding. For case\r\n`int [_, b] = {0, 0};`, the `DecompositionDecl` is valid, and its\r\nchildren `BindingDecl`s are valid but with a NULL type, this breaks\r\nclang invariants in many places, and using these `BindingDecl`s can lead\r\nto crashes. This patch fixes them by marking the DecompositionDecl and\r\nits children invalid.","shortMessageHtmlLink":"[clang][AST] Invalidate DecompositionDecl if it has invalid initializ…"}},{"before":"c532ba4edd7ad7675ba450ba43268aa9e7bda46b","after":"35c19fdde2583e74d940f6cd47b97a5c28bfe368","ref":"refs/heads/main","pushedAt":"2024-01-12T11:08:48.000Z","pushType":"push","commitsCount":2032,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"},"commit":{"message":"[mlir][vector] Support warp distribution of `transfer_read` with dependencies (#77779)\n\nSupport distribution of `vector.transfer_read` ops when operands are\r\ndefined inside of the region of `warp_execute_on_lane_0` (except for the\r\nbuffer from which the op is reading).\r\n\r\nSuch IR was previously not supported. This commit changes the\r\nimplementation such that indices and the padding value are also\r\ndistributed.\r\n\r\nThis commit simplifies the implementation considerably: the original\r\nimplementation created a new `transfer_read` op and then checked if this\r\nnew op is valid. If not, the rewrite pattern failed. This was a bit\r\nhacky. It was also a violation of the rewrite pattern API (detected by\r\n`MLIR_ENABLE_EXPENSIVE_PATTERN_API_CHECKS`) because the IR was modified,\r\nbut the pattern returned \"failure\".","shortMessageHtmlLink":"[mlir][vector] Support warp distribution of <code>transfer_read</code> with depe…"}},{"before":null,"after":"633884748c82acf4026ad773a08fc44fba9ef3f9","ref":"refs/heads/glibc_fstat_interceptor_fix","pushedAt":"2023-12-15T07:43:42.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"dustanddreams","name":"Miod Vallat","path":"/dustanddreams","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/118974489?s=80&v=4"},"commit":{"message":"Make the fstat{,64} interceptors wrap fstat{,64} on glibc.\n\nInvoking the underlying __fxstat64 routine is not portable, for the\nstat struct version argument is platform-dependent and not exposed in\nthe userland headers.\n\nFixes #75346.","shortMessageHtmlLink":"Make the fstat{,64} interceptors wrap fstat{,64} on glibc."}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEa1BSMgA","startCursor":null,"endCursor":null}},"title":"Activity · dustanddreams/llvm-project"}