Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modules support #1293

Open
sam-mccall opened this issue Sep 15, 2022 · 28 comments
Open

Modules support #1293

sam-mccall opened this issue Sep 15, 2022 · 28 comments
Labels
enhancement New feature or request

Comments

@sam-mccall
Copy link
Member

sam-mccall commented Sep 15, 2022

It would be nice if some of {C++20 modules, clang header modules} worked in clangd.

Essentially no configurations of this work today: occasionally things may happen to work but there are many reported problems and crashes. These are not just bugs that can be fixed, a design and new infrastructure is needed.

Because clang's module functionality can be enabled by setting driver flags, it is possible/easy for people to end up in this broken and unsupportable state, and wasting (their + our) time debugging it. We should consider failing early and explicitly instead.

AFAIK nobody has plans/availability to work on this soon.

Some issues that need to be addressed:

  • which subset of clang's many module features/flags are in scope?
  • explicit modules provided by build system: we can't assume these are compatible (may be a different compiler version). So may need to convert to implicitly building modules needed.
  • having AST workers/background threads spawning module builds independently and coordinating through the module cache is probably not good, we'll want some kind of worker pool and coordination
  • we presumably want to parse modules in the same lite mode as preambles (e.g. skip-function-bodies)
  • gcc modules flags are different than clang modules flags (and MSVC presumably different again), we need to parse them
  • how do modules fit into the dynamic index? what does a preamble with modules in it look like?
  • scattered places where we assume mainfile/preamble dichotomy, reason about #include structure, etc
  • actual handling of import/export AST nodes
@HighCommander4
Copy link

Some other open bugs we have on file related to modules are #738, #797, #1105.

@Trass3r
Copy link

Trass3r commented Nov 19, 2022

This includes -fmodules? What is the current state?
With an llvm LLVM_ENABLE_LIBCXX=ON LLVM_ENABLE_MODULES=ON workspace clangd doesn't crash or anything but produces lots of squiggles while the build itself runs through just fine.
Shouldn't clangd rather ignore those arguments if it isn't able to handle them properly? (-fmodules -fmodules-cache-path=... -Xclang -fmodules-local-submodule-visibility -gmodules)

@mathstuf
Copy link

mathstuf commented Dec 1, 2022

(Note: speaking as a CMake developer here with a focus on how modules work with CMake; other build systems have similar challenges, but not necessarily the same ones)

gcc modules flags are different than clang modules flags (and MSVC presumably different again), we need to parse them

Note that gcc's module flags may indicate an active participant if -fmodule-mapper= points to a socket or process execution. That may not be emulatable (I believe build2 does this, but CMake uses static files written out during the build).

I suspect that for static analysis, a lot more communication between build systems and tooling to know what is going on. The problem is that with explicit modules even compile_commands.json is not useful until a build happens because:

  • for gcc, the -fmodule-mapper= argument is generated at build
  • for clang and msvc, modules are communicated via @rspfile arguments that don't exist until the build is performed

This isn't even counting the fact that the information presented there isn't directly usable as any referenced BMI files will be in the host compiler, not something the tool necessarily understands.

@MK-Alias
Copy link

MK-Alias commented Dec 2, 2022

In the age we live in, a programming language is only as good as it runtimes/frameworks and lastly editor support... I mean, who codes these days with just syntax highlighting? C++ modules are - from a user perspective - a very nice addition to the C++ syntax end really modernizes it. However, it is really being held back by editor support.

AFAIK: C++ modules are only supported by VisualStudio (using msvc ixx files) and CLion (using clang cppm files). If clangd would support it, it will significantly improve editor support and with that its usability/popularity.

As for compilers having there own flags/options/file-extensions for modules - which is regretful - just ignore all that and start with supporting the clangs own formats.

Although this is - currently - not a critical functionality, I personally am hoping to see modules supported by clangd soon!

@sdykae
Copy link

sdykae commented Dec 29, 2022

from https://gitlab.kitware.com/cmake/cmake/-/issues/18355
image
https://github.com/mathstuf/llvm-project/tree/p1689r5-csd
Waiting for stable llvm pr, and support to clangd to start development with vscode extension.

@HighCommander4
Copy link

It may be somewhat instructive to look at CLion's approach to supporting modules, which first appeared in the latest 2022.3 release: https://blog.jetbrains.com/clion/2022/11/clion-2022-3-released/#cpp_modules

It's somewhat light on details but this part seems relevant:

CLion collects module information from all .ixx, .cppm, and .mxx files in the project and doesn’t rely on the specific approach you use to build modules. Whether it’s CMake with the Visual Studio C++ toolchain or CMake with Clang and specific compilation flags, CLion provides a similar experience

@ChuanqiXu9
Copy link

ChuanqiXu9 commented May 31, 2023

CLion collects module information from all .ixx, .cppm, and .mxx files in the project and doesn’t rely on the specific approach you use to build modules.

This is not strictly correct since it is allowed to declare a module within a .cpp files. But it should be acceptable as the first step.

It may be somewhat instructive to look at CLion's approach to supporting modules, which first appeared in the latest 2022.3 release: https://blog.jetbrains.com/clion/2022/11/clion-2022-3-released/#cpp_modules

Yeah. Although it didn't mention a lot technical details, I guess we can't learn a lot from this blog. And it only mentioned about the second requirement given by @sam-mccall about the implicit buildings.


I guess it may be OK to be too good as the first step. I mean, maybe is it acceptable to ignore some requirements in the first step. Including:

  • Compatibility about GCC and MSVC.
  • Indexing things.
  • lite mode things (e.g. skip-function-bodies)
  • #include structures

And what is good as the first step? I think it may be good if:

  • We can collect the potential module files by clang-scan-deps in the project tree.
  • We can build these module files implicitly in order.
  • We can get the AST we just built.

How do you guys think about this? @HighCommander4

@RedBeard0531
Copy link

We can collect the potential module files (by its suffixes) in the project tree.

Please don't use extensions to decide which files are modules. It is perfectly valid to have a .cpp file be a module interface. Please just use a scanner. A lot of work went in to the standard wording to ensure that it is possible to very efficiently scan the whole codebase. I know some effort was started in clang-scan-deps for this. It would be great it clangd could reuse and possibly improve this code when doing its scan.

@ChuanqiXu9
Copy link

We can collect the potential module files (by its suffixes) in the project tree.

Please don't use extensions to decide which files are modules. It is perfectly valid to have a .cpp file be a module interface. Please just use a scanner. A lot of work went in to the standard wording to ensure that it is possible to very efficiently scan the whole codebase. I know some effort was started in clang-scan-deps for this. It would be great it clangd could reuse and possibly improve this code when doing its scan.

Yeah, it is better and more clear. I've edited the post.

@ChuanqiXu9
Copy link

After taking some times to look at the code, I have some lower-level ideas: It looks possible to support C++20 modules in BackgroundIndex.

  1. We need to introduce a new class to describe to dependencies information between named modules. And it looks OK to make the class only visible to BackgroundIndex . We can use clang-scan-deps to initialize new class.
  2. Then in BackgroundIndex::changedFilesTask, if the file is a module interface unit, we will try to generate the module interface into a special cache maintained by clangd first before we index the file actually. It requires clangd to rewrite the command line to emit the BMI to other places. This is also helpful if we want to emit different BMI someday.
  3. When we index source files, we need to rewrite the command line to lookup the BMI we generated at first. This is necessary to not breaking our normal build.

In my mind, it looks like it is workable. The version is not locked and it is possible to work with GCC and MSVC. Also there are some problems in the compiler side. e.g., the size of BMI generated by clang is too big now. Maybe this may break the cache. But I feel it OK as the first step.

Also There are some higher-level issues:
(1) When the clangd server initializes, will BackgroundIndex::changedFilesTask get called? From the name it looks like it won't be called. But I can't find other APIs.
(2) The idea above requires us always to scan all sources at first, which sounds a little bit scary since it only wastes time for non-module projects. The only method I got now is to add an option. This option is disabled by default now and it can be enabled by default someday if modules become popular.

@HighCommander4
Copy link

And what is good as the first step? I think it may be good if:

  • We can collect the potential module files by clang-scan-deps in the project tree.

  • We can build these module files implicitly in order.

  • We can get the AST we just built.

How do you guys think about this? @HighCommander4

My perspective as a user is that yes, this level of support for modules would already be useful and valuable. I will defer to @sam-mccall for a maintainer's perspective of what would be accepted as a contribution. (I would imagine that it's fine to have various limitations especially if the support is behind an option/flag.)

A few thoughts/questions:

  • To make the performance acceptable, we'll want to persist BMIs rather than rebuilding them each time we encounter a module. (I think all previous comments assume this, I'm just making it explicit.)
  • To avoid stepping on the project's actual build (which may use a different compiler or a different version of clang), we should persist our BMIs in a clangd-specific place (such as .cache/clangd), like you said in the previous comment.
  • Is there an opportunity to reuse this persisted BMI cache between the indexer and the preambles of open files? When you open a source file that imports a module M, the BMI for M may or may not have already been built by the indexer. If it hasn't, I imagine the AST (preamble) build should build it on the spot (and then the indexer won't need to build it again when it gets around to indexing a source file that imports M), but if the indexer has already built M's BMI then I imagine the preamble build can just read it.

@HighCommander4
Copy link

(1) When the clangd server initializes, will BackgroundIndex::changedFilesTask get called?

Yes, when the compilation database is loaded, it calls this function with all of the entries in the database, which will call changedFilesTask().

@ChuanqiXu9
Copy link

Is there an opportunity to reuse this persisted BMI cache between the indexer and the preambles of open files?

I am not pretty sure that I've understood the idea of the indexer. My current imagination is to let the TUScheduler to hold the ModulesDependencyGraph. And when TUScheduler::update is called, check if the file has any modules dependencies, if it has such dependencies and the BMI doesn't exist, don't call ASTWorker::update now and assign a task to AsyncRunner to build the BMI and call the ASTWorker::update then.

Now it is OK roughly for the a-b test case in llvm/llvm-project#58723.

I feel like the there is only one TUScheduler in one project (is this true?) So the BMIs won't be rebuilt every time we opened a new file.

But I feel I missed some points since I didn't touch the indexer at all. I initially want to work in BackgroundIndex but the log shows it is not reached after I open a new project even if I add -background-index. Or is it OK to not make the indexer know about the modules?

@HighCommander4
Copy link

HighCommander4 commented Jun 2, 2023

I feel like the there is only one TUScheduler in one project (is this true?)

I believe so.

So the BMIs won't be rebuilt every time we opened a new file.

Note, I was also thinking of persisting the BMIs between clangd invocations.

is it OK to not make the indexer know about the modules?

The indexer also triggers AST builds (for every file in the project that isn't already indexed), that happens around here. That AST build will encounter imports, so I guess it should know about modules as well.

@ChuanqiXu9
Copy link

Note, I was also thinking of persisting the BMIs between clangd invocations.

It makes sense. I feel like it is OK

The indexer also triggers AST builds (for every file in the project that isn't already indexed), that happens around here. That AST build will encounter imports, so I guess it should know about modules as well.

My thought is that it works as long as we rewrote the command line to refer to the new BMIs we built.

@HighCommander4
Copy link

The indexer also triggers AST builds (for every file in the project that isn't already indexed), that happens around here. That AST build will encounter imports, so I guess it should know about modules as well.

My thought is that it works as long as we rewrote the command line to refer to the new BMIs we built.

I think that would work as part of the solution. Another part of the solution would be to ensure that the indexing of source files only starts after the BMIs have been built.

Presumably, something should proactively drive the building of BMIs (not just when you open a source file that depends on it), so that the indexing of source files eventually gets unblocked. Maybe it makes sense to think of building the BMIs as the "first stage" of indexing, and indexing the source files as the "second stage"?

@ChuanqiXu9
Copy link

I think that would work as part of the solution. Another part of the solution would be to ensure that the indexing of source > files only starts after the BMIs have been built.

Presumably, something should proactively drive the building of BMIs (not just when you open a source file that depends on it), so that the indexing of source files eventually gets unblocked. Maybe it makes sense to think of building the BMIs as the "first stage" of indexing, and indexing the source files as the "second stage"?

Do you mean only indexing the source files after we built all the BMIs or after we finish the first stage (since it is possible that some BMI is not buildable)? Sounds like a good idea. It can ease the implementation.

@sam-mccall
Copy link
Member Author

A few random thoughts...

(Full disclosure: I did some brainstorming about this a few months ago and couldn't come up with anything good, hopefully y'all are more creative & understand modules better!)

  • module building looks kinda like indexing, but i think focusing on background indexing is a red herring, and preamble building/indexing is where we should begin. We need to be able to open a file and start using it without any significant fraction of the project having been indexed. This is the single most critical latency path and to hit it today (building preamble) we take shortcuts which are unacceptable for background indexing (e.g. skip-function-bodies). The
  • the project may be so large that we cannot enumerate the modules, let alone touch every file. If we want to rely on clang-scan-deps for small projects, that makes sense, but we should abstract it somehow with a query interface that scales (like CompilationDatabase)
  • the big performance win with modules is the ability to reuse modules between invocations e.g. share the overlapping parts of preambles rather than starting from scratch. However editing headers will still force us to rebuild all dependent modules/preambles as modules are tightly coupled to their dependencies. (If you built a.pcm#1 against b.pcm#5 you can't reuse it with b.pcm#6).
  • we'll need some new abstractions, which are always hard to get right. TUScheduler is basically "full" in terms of complexity, we can't cram many more responsibilities in there without losing ability to reason about it. Obviously it will have to change, but the bulk of complexity should live somewhere else.
  • "modules" can be enabled in (too) many ways, and ultimately the design needs to accommodate as many as possible (there's no room for several implementations!). But we should be clear where we're starting (e.g. only fully-explicit command lines with -fmodule-file=foo=bar using c++20 syntax?) and roughly how other variants will fit in (can we translate implicit to explicit using clang-scan-deps? Do module maps make any difference?)

@RedBeard0531
Copy link

RedBeard0531 commented Jun 2, 2023

(note: I know nothing about clangd's internals so this comment is based on guesses based on observable behavior. Hopefully it isn't complete nonsense and if needed someone can translate into the correct terminology)

I assume clangd must already have an implementation of a proto-buildsystem internally. For example, it can schedule indexing tasks with a controlled amount of parallelism and trigger reindexing of downstream TUs when a header-file changes. Ideally it also notices when files change outside the editor (eg git) or when the editor is closed and is able to do incremental reindexing where needed (although I've experienced issues around this in the past, less so recently).

I think part of the solution is to double-down on having an integrated build system. Either continue improving the integrated one, or switch to something like llbuild. The key thing that I don't think was required before was task dependencies. While you did have source/header file -> TU indexer job dependencies, they merely triggered rerunning the task. Now you will need some indexer tasks to have an input dependency on the BMI which will be generated by another task. And it is important to rerun the downstream indexers whenever an upstream BMI is meaningfully changed (defining "meaningfully" can be tricky; a first pass can just treat any content difference as meaningful, but I suspect that can be improved on).

Additionally you cannot begin doing (much of) anything prior to scanning to determine module names and import requirements. And of course, when a file changes, it must be rescanned because it may have added an import, or changed which module/partition it declares itself to be part of. Files can also go from not generating a BMI to generating one eg if you add a export module foo; line or change module blah; to module blah:blah;

I don't know how easy or hard it would be to fit the concept of task dependencies into the existing proto-buildsystem, but at least from an outsider's perspective, that seems like the best (of not only) option. Or maybe I've just spent too much time working on buildsystems, so they feel like the solution to every problem now... 😄

@sam-mccall sam-mccall reopened this Jun 2, 2023
@sam-mccall
Copy link
Member Author

(sorry, hit the wrong button)

Is there an opportunity to reuse this persisted BMI cache between the indexer and the preambles of open files?

The "lite mode" ASTs (skip-function-bodies etc) may not be suitable for indexing, and vice versa (even opportunistically: full ASTs are much larger => slower to load). I believe we can't mix the two due to diamond dependency problems. They may also have conflicting requirements for how to deal with staleness. If these can be solved, it would be fantastic.

I assume clangd must already have an implementation of a proto-buildsystem internally. For example, it can schedule indexing tasks

A very simple one for background indexing (no dependencies, as you note). However very deliberately no core functionality depends on this: the index is entirely optional and all features work (though degraded) if it's stale.

If we're going to have a build system we block on (and modules ~requires it) then it needs to be something new. It's going to be responsible for the critical path to user-facing latency, as opposed to background index's "process an infinite amount of non-critical batch work".

@ChuanqiXu9
Copy link

module building looks kinda like indexing

I am not sure what you mean for indexing. Abstractly, do you mean a map from source files to the BMI? If yet, I think we're on the same page.

We need to be able to open a file and start using it without any significant fraction of the project having been indexed. This is the single most critical latency path and to hit it today (building preamble) we take shortcuts which are unacceptable for background indexing (e.g. skip-function-bodies).

the project may be so large that we cannot enumerate the modules, let alone touch every file. If we want to rely on clang-scan-deps for small projects, that makes sense, but we should abstract it somehow with a query interface that scales (like CompilationDatabase)

Yeah, this is the thing I want to talk to. In my local experiments, I find it is hard/impossible to make modules for clangd work well if we don't do pre-scanning the whole project initially. In my experiments, currently, I only do on demand modules building and parsing. It may be OK for non-modules code since every TU don't dependent on each other.

But this is problematic with modules. For the simple A-B example,

// b.cppm
export module b;
export int bbbb = 44;

// a.cppm
export module a;
import b;
export int a() {
    return bbbb;
}

Currently, my clangd works if I open b.cppm and a.cppm both. But if I clear the caches and close the 2 tabs and restart the clangd, then when I open a.cppm, I see the annoying red line again. The red line only disappears after I open b.cppm. And this may be super bad in a big project since it makes sense that no one will open every page before he works.

Since now I only scan on demands, when I see import b;, I don't know where module b comes from for real.

And I can image another bad case. Since we'd like to replace -fmodule-file=<module-name>=<path-to-BMI>. But without the pre-scanning for the whole project, we can't know if <module-name> is a third-party module. So that we can't decide to replace it or not.


So now I feel it is impossible to support modules well the scanning initially. Also I agree that it sounds bad. Any ideas?


we'll need some new abstractions, which are always hard to get right. TUScheduler is basically "full" in terms of complexity, we can't cram many more responsibilities in there without losing ability to reason about it. Obviously it will have to change, but the bulk of complexity should live somewhere else.

Agreed. But let's talk about this after we have some real demo.

"modules" can be enabled in (too) many ways, and ultimately the design needs to accommodate as many as possible (there's no room for several implementations!). But we should be clear where we're starting (e.g. only fully-explicit command lines with -fmodule-file=foo=bar using c++20 syntax?) and roughly how other variants will fit in (can we translate implicit to explicit using clang-scan-deps? Do module maps make any difference?)

We don't have so-many ways to enable C++20 modules. Maybe you have the impression due to the existence of clang modules. For example, we won't have implicit dependencies within C++20 modules. And we won't have a module map.
BTW, ideally, the diamond dependency won't be a problem since we will require the consistent build environments.

@sam-mccall
Copy link
Member Author

sam-mccall commented Jun 2, 2023

I am not sure what you mean for indexing. Abstractly, do you mean a map from source files to the BMI? If yet, I think we're on the same page.

Right - we have to parse the source code and process the results, producing respectively index info or BMI, and we want to do so for roughly the same set of sources, triggered by the same events (when a file is opened: all transitively imported things. In the background: all things in the project).

Yeah, this is the thing I want to talk to. In my local experiments, I find it is hard/impossible to make modules for clangd work well if we don't do pre-scanning the whole project initially

Agree, neither C++20 modules nor clang header modules don't provide a scalable way to discover module structure. This is a pretty major flaw we need to design around. Though it's basically the same difficulty with resolving #includes, and admits some of the same solutions:

  • we can scan the whole project lexically
  • we can have the build system provide an artifact like "modules.json", generated at e.g. cmake time
  • we can query the build system at runtime (e.g. bazel knows module boundaries)

It's fine to fall back to #1, but we need a system that can be extended to #2 or #3.

And I can image another bad case. Since we'd like to replace -fmodule-file==. But without the pre-scanning for the whole project, we can't know if is a third-party module. So that we can't decide to replace it or not.

I'm not sure exactly what you mean by "third-party module", but I don't see a way we can safely consume modules that clangd itself didn't build. (Version skew, among other things). So I think we'll need to be at least rewriting all module flags, but ideally programatically replacing all the module loading/caching/resolving interfaces.

We don't have so-many ways to enable C++20 modules. Maybe you have the impression due to the existence of clang modules.

Are you suggesting we only support C++20 named modules? Adding the complexity needed for modules in general but leaving out clang modules seems like a tough sell.
(FWIW my understanding is both Google and Meta are interested in having modules work in clangd and contributing to this, both are using clang modules rather than C++20 modules)

Also, I believe clang modules are at least partly covered by C++20 as "header units". Whether or not we choose not to call them "modules", it seems like we should support codebases that use them.

For example, we won't have implicit dependencies within C++20 modules.

OK, my understanding is a bit hazy. Which of the following does clang's impl of C++20 modules require/forbid?

  • all imported modules have been built in a separate invocation, before the importing code is built
  • paths to all the imported modules are provided explicitly (rather than being probed for in some cache dir)
  • names of the imported modules are provided explicitly (rather than being retrieved from the PCM file)

BTW, ideally, the diamond dependency won't be a problem since we will require the consistent build environments.

I mentioned the diamond dependency problem specifically because we currently have two build environments (skip-function-bodies in preamble vs not when background-indexing). I agree that if we keep our build configurations isolated and are content to rebuild all dependent modules when a source file changes, we won't have a diamond dependency problem. (But each of these cost performance)

@RedBeard0531
Copy link

RedBeard0531 commented Jun 2, 2023

the project may be so large that we cannot enumerate the modules, let alone touch every file.

So now I feel it is impossible to support modules well the scanning initially. Also I agree that it sounds bad. Any ideas?

I'm not sure how bad it really is. A lot of work went into putting restrictions into the C++20 specification to ensure that it will be possible to write a very fast scanner. It should be possible to scan at or near disk speeds on a single core (maybe not as fast as a high-end NVMe, but certainly several hundred MB/s per core should be achievable). So I think this should scale up to any meaningful definition of "project". This still won't scall up to a 1TB mega-monorepo where you never check out the whole repo at once, and certainly never build it all locally. So you will need an alternative for them (luckily most of them already are willing to invest in custom tooling), but scanning all project files should still be fine for the vast majority of clangd users. I think (admittedly, without evidence yet) a reasonable rule of thumb would be that anything small enough to have a single compile_commands.json file (which implies parsing a single json file with the full command line for every TU) should be able to scan all files within it. @Bigcheese might be able to comment more authoritatively on the performance of clang-scan-deps specifically.

That said, you don't have to eagerly do a full scan on everything. You can lazily determine the import DAG while indexing or loading the AST of the file being edited. The only thing you really need to know upfront is the mapping from module/partition name -> source file. While not 100% spec compliant[1], it is probably acceptable to do a "preflight scan" and search all files with the regex \bmodule\b and only do a full scan on them since they are the only ones that can define named modules (it is non-obvious, but this grammar rule means that module must appear in the source file itself, not only via an #include). Hopefully, in the (far) future the vast majority of C++ files will have module in them so this may not make sense forever, but since that is far from true today, you can save some time this way. Megaprojects, and possibly whatever the equivalent of /usr/include is for modules (still TBD), will need some sort of metadata (static files or server) to find the source for each named module.

[1] Ideally line splicing wouldn't be legal inside of a keyword or directive. But I think anyone who splits module across physical lines outside of a compiler stress test deserves what they get...

If we're going to have a build system we block on (and modules ~requires it) then it needs to be something new. It's going to be responsible for the critical path to user-facing latency, as opposed to background index's "process an infinite amount of non-critical batch work".

I think "~requires" is a good way to put it. The standard is very carefully worded to avoid placing restrictions on the implementation details. So a build system isn't technically required, and neither are BMIs. In theory, an implementation could parse each TU by simply starting at the top of the file and handling import similarly to #include by simply reading the source file directly, but in a special context to preserve the modular semantics. I don't know if clang supports such a mode, but I assume that the building blocks would be very useful since they would enable the equivalent of single-file -E -frewrite-includes roll-ups of TUs for bug reports and testing. Such a mode (if it exists) may be the shortest path to getting modules working in clangd, since it will at least "work" for both indexing and the open file AST, even if it is slower than it should be.

But having a build system (shorthand for any DAG-walking parallel task runner and incremental re-runner, not restricted to the traditional concept of a build system) and building some sort of BMI (a reusable cache of a module (unit)'s interface) is definitely what the design of C++ modules is optimized for. Hopefully it will result in significantly faster (re)indexing, and open file loading/parsing. Once you have that in place, about the only time I could imagine parsing modules directly from source being useful is when you cold-open a project and want to minimize the latency to first parse of a single open file. But even then, I'm not sure it will actually have lower latency than telling the build system to pause what it is doing and put all resources into building the BMIs for the transitive deps of the current file. Consider that it will be much simpler to parallelize the parsing of independent nodes in the DAG if they are separate tasks, so given enough cores (and core counts are more likely to go up than down over time!) your latency will be proportional to the depth of the DAG from your source file rather than its total size. And this becomes even more of a win if there are multiple open files (eg reopening a project with the same editor views as last time, or just restarting clangd) since they can share the benefit of parsing common deps only once.

full ASTs are much larger => slower to load

I'm not a clang dev, but from hallway coverversions my understanding is that all compilers are planning to have as close [edit: as possible] to an O(1) loading cost for BMIs. So the cost of adding an import to a file should be as close to zero as possible, and you will only have to pay for loading the parts of the BMI that you actually use, even if the BMI is huge. I think this will involve both carefully designed on-disk data structures and using mmap to load the BMI and use those data structures in-place. So I wouldn't worry too much about this, at least not yet. Of course I could see a concern about the cost to generate the full vs partial BMI contributing to file-open latency, but that is a separate issue. While it may make sense to generate a partial BMI if a full one isn't available yet, it probably isn't worth generating a partial one to save loading time when a full BMI is ready to use. Either way, none of that is on the critical engineering path to getting minimally working module support into clangd, so it may be better to start with only full BMIs and add partials as a possible optimization later. I assume anyone using modules would prefer to get it working sooner, even if it is slower than ideal.

@ChuanqiXu9
Copy link

  • we can scan the whole project lexically
  • we can have the build system provide an artifact like "modules.json", generated at e.g. cmake time
  • we can query the build system at runtime (e.g. bazel knows module boundaries)

It's fine to fall back to #1, but we need a system that can be extended to #2 or #3.

I feel like the second point is good. (bazel know nothing about C++20 Named modules now). And it should be good to implement 1 at first.

I'm not sure exactly what you mean by "third-party module", but I don't see a way we can safely consume modules that clangd itself didn't build. (Version skew, among other things). So I think we'll need to be at least rewriting all module flags, but ideally programatically replacing all the module loading/caching/resolving interfaces.

I mean modules whose source codes are not in the tree of the project. For example, we'll have import std; in LLVM17 (in planning), and clearly the source code of std modules won't be part of the project. How can we find the source codes? Especially for other third-party modules.

but I don't see a way we can safely consume modules that clangd itself didn't build. (Version skew, among other things).

My thought is that clangd won't need to consume such modules. It just fallbacks to the compiler to consume them. And it's the responsibility of users to make sure such modules are usable. I am not opposite to build modules by clangd as much as possible. I just feel it is not easy to find the corresponding source codes out of tree.

Are you suggesting we only support C++20 named modules? Adding the complexity needed for modules in general but leaving out clang modules seems like a tough sell.
(FWIW my understanding is both Google and Meta are interested in having modules work in clangd and contributing to this, both are using clang modules rather than C++20 modules)

C++20 Named modules are significantly different from clang header modules and header units. Since named modules is a TU by itself but clang header modules and header units have header's semantics. So I image it may be OK to treat them as headers for clangd. Or at least I think it is OK to consider C++20 Named Modules at first. Since it should be the most complex one.

OK, my understanding is a bit hazy. Which of the following does clang's impl of C++20 modules require/forbid?

  • all imported modules have been built in a separate invocation, before the importing code is built
  • paths to all the imported modules are provided explicitly (rather than being probed for in some cache dir)
  • names of the imported modules are provided explicitly (rather than being retrieved from the PCM file)

All of the three are required by C++20 named modules.

I'm not sure how bad it really is. A lot of work went into putting restrictions into the C++20 specification to ensure that it will be possible to write a very fast scanner.

Yeah, now we get the consensus to scan the whole project at the first.

, it is probably acceptable to do a "preflight scan" and search all files with the regex \bmodule\b and only do a full scan on them since they are the only ones that can define named modules

I am not sure if it will be really faster. Let's re-see this after we realized the full-scanning is bottle-neck.

I don't know if clang supports such a mode, but I assume that the building blocks would be very useful since they would enable the equivalent of single-file -E -frewrite-includes roll-ups of TUs for bug reports and testing. Such a mode (if it exists) may be the shortest path to getting modules working in clangd, since it will at least "work" for both indexing and the open file AST, even if it is slower than it should be.

Clang doesn't support this mode now. It looks like there are some extra works to implement it especially to remain the modules semantics. In my brain I feel like it is farer path.

but from hallway coverversions my understanding is that all compilers are planning to have as close to an O(1) loading cost for BMIs.

Currently, it is not strictly O(1) in clang and I guess we won't never achieve that. But the cost is really low in my experience.

so it may be better to start with only full BMIs and add partials as a possible optimization later. I assume anyone using modules would prefer to get it working sooner, even if it is slower than ideal.

Strongly agreed : )

@mathstuf
Copy link

mathstuf commented Jun 2, 2023

[1] Ideally line splicing wouldn't be legal inside of a keyword or directive. But I think anyone who splits module across physical lines outside of a compiler stress test deserves what they get...

IIRC, we decided that line splicing is not allowed on module lines (import or export). I don't know where that wording is in the standard though :/ .

Also, I believe clang modules are at least partly covered by C++20 as "header units". Whether or not we choose not to call them "modules", it seems like we should support codebases that use them.

All known (to me) clang module deployments assume a highly consistent build environment. General C++ cannot assume that and must handle the "build a BMI per usage site" worst-case (collapsing compatible ones down is possible, but non-trivial to detect).

we can have the build system provide an artifact like "modules.json", generated at e.g. cmake time

The "at cmake time" part is highly unlikely to be possible (in general). The @modulemap files in the compile_commands.json and any analogous modules.json files are build-time artifacts. Additionally, modules.json is only reliable when provided per-target anyways[1].

[1] Consider where such a "build a complete modules.json file" rule would exist. It would have to depend on every modules.json from each target, but when a target is rebuilt using ninja just_one_target, what would update this file reliably?

@HighCommander4
Copy link

HighCommander4 commented Jun 2, 2023

Is there an opportunity to reuse this persisted BMI cache between the indexer and the preambles of open files?

The "lite mode" ASTs (skip-function-bodies etc) may not be suitable for indexing, and vice versa (even opportunistically: full ASTs are much larger => slower to load).

My limited understanding of BMIs is that they're similar to PCHs (which is also what our preambles use), whose deserialization is done more or less on demand (e.g. when iterating all decls() of a NamespaceDecl). This makes me hopeful that a "full" BMI produced by the indexer can be consumed by a preamble build with fairly good performance (and in particular faster than rebuilding that BMI in "lite mode").

I don't see a way we can safely consume modules that clangd itself didn't build

While so far we have been discussing an "MVP" where clangd builds all the modules itself, in the longer term I'm hoping that we will be able to let users opt into a mode of operation where BMIs are shared with the project's build (which would require the build compiler to be clang and clangd to be version-locked to it, and possibly some other restrictions such as the user not using CompileFlags: Add: to add any clangd-specific flags that affect the BMI contents).

@ChuanqiXu9
Copy link

ChuanqiXu9 commented Jun 16, 2023

Update: I sent a patch (https://reviews.llvm.org/D153114) for review now. See the review page for details. It is welcome for all people here to give some comment.

@ChuanqiXu9
Copy link

Update: https://reviews.llvm.org/D153114 is abandoned but the review opinions are pretty helpful. See llvm/llvm-project#66462 for the newest progress.

ChuanqiXu9 added a commit to ChuanqiXu9/llvm-project that referenced this issue Nov 27, 2023
Alternatives to https://reviews.llvm.org/D153114.

Try to address clangd/clangd#1293.

See the links for design ideas. We want to have some initial support in
clang18.

This is the initial support for C++20 Modules in clangd.
As suggested by sammccall in https://reviews.llvm.org/D153114,
we should minimize the scope of the initial patch to make it easier
to review and understand so that every one are in the same page:

> Don't attempt any cross-file or cross-version coordination: i.e. don't
> try to reuse BMIs between different files, don't try to reuse BMIs
> between (preamble) reparses of the same file, don't try to persist the
> module graph. Instead, when building a preamble, synchronously scan
> for the module graph, build the required PCMs on the single preamble
> thread with filenames private to that preamble, and then proceed to
> build the preamble.

And this patch reflects the above opinions.
DanShaders pushed a commit to DanShaders/llvm-project that referenced this issue Apr 13, 2024
Alternatives to https://reviews.llvm.org/D153114.

Try to address clangd/clangd#1293.

See the links for design ideas. We want to have some initial support in
clang18.

This is the initial support for C++20 Modules in clangd.
As suggested by sammccall in https://reviews.llvm.org/D153114,
we should minimize the scope of the initial patch to make it easier
to review and understand so that every one are in the same page:

> Don't attempt any cross-file or cross-version coordination: i.e. don't
> try to reuse BMIs between different files, don't try to reuse BMIs
> between (preamble) reparses of the same file, don't try to persist the
> module graph. Instead, when building a preamble, synchronously scan
> for the module graph, build the required PCMs on the single preamble
> thread with filenames private to that preamble, and then proceed to
> build the preamble.

And this patch reflects the above opinions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants