Skip to content

Conversation

@walter-erquinigo
Copy link
Collaborator

As preliminary context, everything belows applies to the client. The server side is all fine.

For NVIDIA, after connecting to the GPU process we issue a resume_gpu GPUAction. We were doing that after connecting to the process asynchronously, which resulted in a race condition in which the following actions were happening at the same time:

  • the gpu process was handling its stop state, fetching the state of its threads
  • the GPUAction was issuing a Resume request

this caused that very often, after resuming, the gpu process was left in a Running state but with the thread metadata not cleared. Then, in the next stop event, it would reuse the previous thread metadata instead of using the new one.
Not only that, this sometimes caused many event listeners that were processing the original stop event to not complete their job after the GPUAction issued the Resume request.

A simple way to fix this is to connect to the gpu process synchronously. This ensures that all the stop event handlers have been processed and that we can safely issue the Resume without race conditions.

A better fix would be to add a WaitForProcessToStop right before processing GPUActions, but I wasn't able to make it work that way. I'm too ignorant.

As preliminary context, everything belows applies to the client. The
server side is all fine.

For NVIDIA, after connecting to the GPU process we issue a resume_gpu
GPUAction. We were doing that after connecting to the process
asynchronously, which resulted in a race condition in which the
following actions were happening at the same time:

- the gpu process was handling its stop state, fetching the state of its
  threads
- the GPUAction was issuing a Resume request

this caused that very often, after resuming, the gpu process was left in
a Running state but with the thread metadata not cleared. Then, in the
next stop event, it would reuse the previous thread metadata instead of
using the new one.
Not only that, this sometimes caused many event listeners that were
processing the original stop event to not complete their job after the
GPUAction issued the Resume request.

A simple way to fix this is to connect to the gpu process synchronously.
This ensures that all the stop event handlers have been processed and
that we can safely issue the Resume without race conditions.

A better fix would be to add a WaitForProcessToStop right before
processing GPUActions, but I wasn't able to make it work that way. I'm
too ignorant.
@walter-erquinigo walter-erquinigo changed the title [To upstream][LLDB] Connect synchronously to the gpu process [LLDB][GPU] Connect synchronously to the gpu process Jul 1, 2025
@walter-erquinigo walter-erquinigo merged commit 668cdc8 into llvm-server-plugins Jul 1, 2025
6 checks passed
@walter-erquinigo walter-erquinigo deleted the clayborg/walter/sync-connection branch July 1, 2025 21:35
pveras pushed a commit to pveras/llvm-project that referenced this pull request Oct 10, 2025
A recent change adding a new sanitizer kind (via Sanitizers.def) was
reverted in c74fa20 ("Revert "[Clang][CodeGen] Introduce the
AllocToken SanitizerKind" (llvm#162413)"). The reason was this ASan report,
when running the test cases in
clang/test/Preprocessor/print-header-json.c:

```
==clang==483265==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x7d82b97e8b58 at pc 0x562cd432231f bp 0x7fff3fad0850 sp 0x7fff3fad0848
READ of size 16 at 0x7d82b97e8b58 thread T0
    #0 0x562cd432231e in __copy_non_overlapping_range<const unsigned long *, const unsigned long *> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2144:38
    clayborg#1 0x562cd432231e in void std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::__init_with_size[abi:nn220000]<unsigned long const*, unsigned long const*>(unsigned long const*, unsigned long const*, unsigned long) zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2685:18
    clayborg#2 0x562cd41e2797 in __init<const unsigned long *, 0> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:2673:3
    clayborg#3 0x562cd41e2797 in basic_string<const unsigned long *, 0> zorg-test/libcxx_install_asan_ubsan/include/c++/v1/string:1174:5
    clayborg#4 0x562cd41e2797 in clang::ASTReader::ReadString(llvm::SmallVectorImpl<unsigned long> const&, unsigned int&) clang/lib/Serialization/ASTReader.cpp:10171:15
    clayborg#5 0x562cd41fd89a in clang::ASTReader::ParseLanguageOptions(llvm::SmallVector<unsigned long, 64u> const&, llvm::StringRef, bool, clang::ASTReaderListener&, bool) clang/lib/Serialization/ASTReader.cpp:6475:28
    clayborg#6 0x562cd41eea53 in clang::ASTReader::ReadOptionsBlock(llvm::BitstreamCursor&, llvm::StringRef, unsigned int, bool, clang::ASTReaderListener&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) clang/lib/Serialization/ASTReader.cpp:3069:11
    clayborg#7 0x562cd4204ab8 in clang::ASTReader::ReadControlBlock(clang::serialization::ModuleFile&, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, clang::serialization::ModuleFile const*, unsigned int) clang/lib/Serialization/ASTReader.cpp:3249:15
    clayborg#8 0x562cd42097d2 in clang::ASTReader::ReadASTCore(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, clang::serialization::ModuleFile*, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, long, long, clang::ASTFileSignature, unsigned int) clang/lib/Serialization/ASTReader.cpp:5182:15
    clayborg#9 0x562cd421ec77 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, clang::serialization::ModuleFile**) clang/lib/Serialization/ASTReader.cpp:4828:11
    clayborg#10 0x562cd3d07b74 in clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool) clang/lib/Frontend/CompilerInstance.cpp:1805:27
    clayborg#11 0x562cd3d0b2ef in clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<clang::IdentifierLoc>, clang::Module::NameVisibilityKind, bool) clang/lib/Frontend/CompilerInstance.cpp:1956:31
    clayborg#12 0x562cdb04eb1c in clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2423:49
    clayborg#13 0x562cdb042222 in clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2101:17
    clayborg#14 0x562cdb043366 in clang::Preprocessor::HandleDirective(clang::Token&) clang/lib/Lex/PPDirectives.cpp:1338:14
    clayborg#15 0x562cdafa84bc in clang::Lexer::LexTokenInternal(clang::Token&, bool) clang/lib/Lex/Lexer.cpp:4512:7
    clayborg#16 0x562cdaf9f20b in clang::Lexer::Lex(clang::Token&) clang/lib/Lex/Lexer.cpp:3729:24
    clayborg#17 0x562cdb0d4ffa in clang::Preprocessor::Lex(clang::Token&) clang/lib/Lex/Preprocessor.cpp:896:11
    clayborg#18 0x562cd77da950 in clang::ParseAST(clang::Sema&, bool, bool) clang/lib/Parse/ParseAST.cpp:163:7
    [...]

0x7d82b97e8b58 is located 0 bytes after 3288-byte region [0x7d82b97e7e80,0x7d82b97e8b58)
allocated by thread T0 here:
    #0 0x562cca76f604 in malloc zorg-test/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:67:3
    clayborg#1 0x562cd1cce452 in safe_malloc llvm/include/llvm/Support/MemAlloc.h:26:18
    clayborg#2 0x562cd1cce452 in llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long) llvm/lib/Support/SmallVector.cpp:151:15
    clayborg#3 0x562cdbe1768b in grow_pod llvm/include/llvm/ADT/SmallVector.h:139:11
    clayborg#4 0x562cdbe1768b in grow llvm/include/llvm/ADT/SmallVector.h:525:41
    clayborg#5 0x562cdbe1768b in reserve llvm/include/llvm/ADT/SmallVector.h:665:13
    clayborg#6 0x562cdbe1768b in llvm::BitstreamCursor::readRecord(unsigned int, llvm::SmallVectorImpl<unsigned long>&, llvm::StringRef*) llvm/lib/Bitstream/Reader/BitstreamReader.cpp:230:10
    clayborg#7 0x562cd41ee8ab in clang::ASTReader::ReadOptionsBlock(llvm::BitstreamCursor&, llvm::StringRef, unsigned int, bool, clang::ASTReaderListener&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>&) clang/lib/Serialization/ASTReader.cpp:3060:49
    clayborg#8 0x562cd4204ab8 in clang::ASTReader::ReadControlBlock(clang::serialization::ModuleFile&, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, clang::serialization::ModuleFile const*, unsigned int) clang/lib/Serialization/ASTReader.cpp:3249:15
    clayborg#9 0x562cd42097d2 in clang::ASTReader::ReadASTCore(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, clang::serialization::ModuleFile*, llvm::SmallVectorImpl<clang::ASTReader::ImportedModule>&, long, long, clang::ASTFileSignature, unsigned int) clang/lib/Serialization/ASTReader.cpp:5182:15
    clayborg#10 0x562cd421ec77 in clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, clang::serialization::ModuleFile**) clang/lib/Serialization/ASTReader.cpp:4828:11
    clayborg#11 0x562cd3d07b74 in clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool) clang/lib/Frontend/CompilerInstance.cpp:1805:27
    clayborg#12 0x562cd3d0b2ef in clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<clang::IdentifierLoc>, clang::Module::NameVisibilityKind, bool) clang/lib/Frontend/CompilerInstance.cpp:1956:31
    clayborg#13 0x562cdb04eb1c in clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2423:49
    clayborg#14 0x562cdb042222 in clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::detail::SearchDirIteratorImpl<true>, clang::FileEntry const*) clang/lib/Lex/PPDirectives.cpp:2101:17
    clayborg#15 0x562cdb043366 in clang::Preprocessor::HandleDirective(clang::Token&) clang/lib/Lex/PPDirectives.cpp:1338:14
    clayborg#16 0x562cdafa84bc in clang::Lexer::LexTokenInternal(clang::Token&, bool) clang/lib/Lex/Lexer.cpp:4512:7
    clayborg#17 0x562cdaf9f20b in clang::Lexer::Lex(clang::Token&) clang/lib/Lex/Lexer.cpp:3729:24
    clayborg#18 0x562cdb0d4ffa in clang::Preprocessor::Lex(clang::Token&) clang/lib/Lex/Preprocessor.cpp:896:11
    clayborg#19 0x562cd77da950 in clang::ParseAST(clang::Sema&, bool, bool) clang/lib/Parse/ParseAST.cpp:163:7
    [...]

SUMMARY: AddressSanitizer: heap-buffer-overflow clang/lib/Serialization/ASTReader.cpp:10171:15 in clang::ASTReader::ReadString(llvm::SmallVectorImpl<unsigned long> const&, unsigned int&)
```

The reason is this particular RUN line:
```
// RUN: env CC_PRINT_HEADERS_FORMAT=json CC_PRINT_HEADERS_FILTERING=direct-per-file CC_PRINT_HEADERS_FILE=%t.txt %clang -fsyntax-only -I %S/Inputs/print-header-json -isystem %S/Inputs/print-header-json/system -fmodules -fimplicit-module-maps -fmodules-cache-path=%t %s -o /dev/null
```

which was added in 8df194f ("[Clang] Support includes translated to
module imports in -header-include-filtering=direct-per-file (llvm#156756)").

The problem is caused by an incremental build reusing stale cached
module files (.pcm) that are no longer binary-compatible with the
updated compiler. Adding a new sanitizer option altered the implicit
binary layout of the serialized LangOptions data structure. The build +
test system is oblivious to such changes. When the new compiler
attempted to read the old module file (from the previous test
invocation), it misinterpreted the data due to the layout mismatch,
resulting in a heap-buffer-overflow. Unfortunately Clang's PCM format
does not encode nor detect version mismatches here; a more graceful
failure mode would be preferable.

For now, fix the test to be more robust with incremental build + test.
clayborg pushed a commit that referenced this pull request Nov 14, 2025
…am (llvm#167724)

This got exposed by `09262656f32ab3f2e1d82e5342ba37eecac52522`.

The underlying stream of `m_os` is referenced by the `TextDiagnostic`
member of `TextDiagnosticPrinter`. It got turned into a
`llvm::formatted_raw_ostream` in the commit above. When
`~TextDiagnosticPrinter` (and thus `~TextDiagnostic`) is invoked, we now
call `~formatted_raw_ostream`, which tries to access the underlying
stream. But `m_os` was already deleted because it is earlier in the
order of destruction in `TextDiagnosticPrinter`. Move the `m_os` member
before the `TextDiagnosticPrinter` to avoid a use-after-free.

Drive-by:
* Also move the `m_output` member which the `m_os` holds a reference to.
The fact it's a reference indicates the expectation is most likely that
the string outlives the stream.

The ASAN macOS bot is currently failing with this:
```
08:15:39  =================================================================
08:15:39  ==61103==ERROR: AddressSanitizer: heap-use-after-free on address 0x60600012cf40 at pc 0x00012140d304 bp 0x00016eecc850 sp 0x00016eecc848
08:15:39  READ of size 8 at 0x60600012cf40 thread T0
08:15:39      #0 0x00012140d300 in llvm::formatted_raw_ostream::releaseStream() FormattedStream.h:205
08:15:39      #1 0x00012140d3a4 in llvm::formatted_raw_ostream::~formatted_raw_ostream() FormattedStream.h:145
08:15:39      #2 0x00012604abf8 in clang::TextDiagnostic::~TextDiagnostic() TextDiagnostic.cpp:721
08:15:39      #3 0x00012605dc80 in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:30
08:15:39      #4 0x00012605dd5c in clang::TextDiagnosticPrinter::~TextDiagnosticPrinter() TextDiagnosticPrinter.cpp:27
08:15:39      #5 0x0001231fb210 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      #6 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      #7 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91
08:15:39      #8 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103
08:15:39      #9 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
08:15:39      #10 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
...
08:15:39
08:15:39  0x60600012cf40 is located 32 bytes inside of 56-byte region [0x60600012cf20,0x60600012cf58)
08:15:39  freed by thread T0 here:
08:15:39      #0 0x0001018abb88 in _ZdlPv+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4bb88)
08:15:39      #1 0x0001231fb1c0 in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      #2 0x0001231fb3bc in (anonymous namespace)::StoringDiagnosticConsumer::~StoringDiagnosticConsumer() ClangModulesDeclVendor.cpp:47
08:15:39      #3 0x000129aa9d70 in clang::DiagnosticsEngine::~DiagnosticsEngine() Diagnostic.cpp:91
08:15:39      #4 0x0001230436b8 in llvm::RefCountedBase<clang::DiagnosticsEngine>::Release() const IntrusiveRefCntPtr.h:103
08:15:39      #5 0x0001231fe6c8 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
08:15:39      #6 0x0001231fe858 in (anonymous namespace)::ClangModulesDeclVendorImpl::~ClangModulesDeclVendorImpl() ClangModulesDeclVendor.cpp:93
...
08:15:39
08:15:39  previously allocated by thread T0 here:
08:15:39      #0 0x0001018ab760 in _Znwm+0x74 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x4b760)
08:15:39      #1 0x0001231f8dec in lldb_private::ClangModulesDeclVendor::Create(lldb_private::Target&) ClangModulesDeclVendor.cpp:732
08:15:39      #2 0x00012320af58 in lldb_private::ClangPersistentVariables::GetClangModulesDeclVendor() ClangPersistentVariables.cpp:124
08:15:39      #3 0x0001232111f0 in lldb_private::ClangUserExpression::PrepareForParsing(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, bool) ClangUserExpression.cpp:536
08:15:39      #4 0x000123213790 in lldb_private::ClangUserExpression::Parse(lldb_private::DiagnosticManager&, lldb_private::ExecutionContext&, lldb_private::ExecutionPolicy, bool, bool) ClangUserExpression.cpp:647
08:15:39      #5 0x00012032b258 in lldb_private::UserExpression::Evaluate(lldb_private::ExecutionContext&, lldb_private::EvaluateExpressionOptions const&, llvm::StringRef, llvm::StringRef, std::__1::shared_ptr<lldb_private::ValueObject>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) UserExpression.cpp:280
08:15:39      #6 0x000120724010 in lldb_private::Target::EvaluateExpression(llvm::StringRef, lldb_private::ExecutionContextScope*, std::__1::shared_ptr<lldb_private::ValueObject>&, lldb_private::EvaluateExpressionOptions const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, lldb_private::ValueObject*) Target.cpp:2905
08:15:39      #7 0x00011fc7bde0 in lldb::SBTarget::EvaluateExpression(char const*, lldb::SBExpressionOptions const&) SBTarget.cpp:2305
08:15:39  ==61103==ABORTING
...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants