Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline Environment Variables in compile_command.json command misparsed? #654

Open
cpsauer opened this issue Jan 19, 2021 · 16 comments
Open
Labels
enhancement New feature or request mac Mostly or entirely affects mac users upstream Bug originates with a library (e.g. clang)

Comments

@cpsauer
Copy link

cpsauer commented Jan 19, 2021

Hi awesome clangd folks,

When providing inline environment variables in a compilation command, I'm seeing behavior that makes me think clangd might be mistaking environment variables for the compiler.

As a toy example, imagine the "command" entry in a compile_command.json file is "A=1 B=2 clang_wrapper file.c", where wrapped_clang crashes if it doesn't get the environment variables it depends on. The expected behavior would be to express what would be invoked if that were executed by a shell: The environment variable A with value 1 would be seen the compiler. Instead, clangd's logs say "compilation failed...index may be incomplete" along with messages saying B is being ignored as a linker. I also saw similar things happen if you try to supply something like "xcrun clang" as a two token compiler. There are definitely things I don't understand about how clangd parses these commands and/or invokes them..but I'm wondering if it's just taking the first token of the command to be the compiler, when that might not be the case.

This definitely shouldn't be that high priority of a thing. Totally understand why clangd would do what I think it does, and I have a workaround. Feel free to ignore, but I thought I should still file an issue as a heads that the edge case cropped up.

Context, if curious: I'm building a tool that extracts compilation commands from Bazel, a build system. On Apple platforms, Bazel wraps the compilers in another binary, and those wrappers rely on environment variables. I can work around this problem by swapping out Bazel's compiler wrapper for clang during the extraction process.

Thanks for reading!

System information:
macOS 11, clangd 11, VSCode

@sam-mccall
Copy link
Member

Ah, indeed I don't think this has ever been supported by clangd or other clang tools.

There's three reasonable things to want here:

  1. skip over the leading A=B and treat wrapped_clang as the driver, to avoid treating it as an argument. This is straightforward and reasonable, nobody ever asked for it yet :-)
  2. pass the env vars to wrapped_clang and use it for parsing. But of course wrapped_clang doesn't offer the API we need, we must use our embedded, instrumented clang parser instead. So this isn't feasible at all.
  3. Probe wrapped_clang to get some details (e.g. built-in include paths) that help our internal parser. This is supported using the --query-driver feature, and with our ad-hoc xcrun probing support. If we parsed the env variables out (1) I suppose we could preserve them when probing

Do you think you'll need 1+3 or is 1 enough?

@sam-mccall sam-mccall added enhancement New feature or request mac Mostly or entirely affects mac users upstream Bug originates with a library (e.g. clang) labels Jan 19, 2021
@cpsauer
Copy link
Author

cpsauer commented Jan 20, 2021

Thanks, Sam, for another thoughtful response.

After reading what you said and the design docs you'd linked, I think this case shouldn't be handled by clangd. Instead it should be on the build system's compile_commands extractor to unwrap things into clang calls, as I'd done in the workaround.

I had guessed (wrongly, it sounds like) that clangd was, at some point, invoking the command given to it and listening to what that driver parsed, via a flag or otherwise. Sounds like that's wrong. I'd assumed the compilation command was being run in some form because the log shows entries like "Failed to compile [file], index may be incomplete" only when the wrapped compiler was being passed in. But perhaps those failures are instead the clangd parser failing because it can't infer the right built-in include paths? The errors did look like NEON intrinsics that might not be supported in a non-Apple set of built-in includes.

Just to confirm: The compiler listed in the command is never invoked, right? But clangd does use the compiler path in the command to find built-in includes for the platform (e.g. Android, Mac, iPhone), even without --query-driver, so long as it looks like a version of clang? I'm a bit confused about when query-driver would need to be provided and why, even after reading the docs and past discussion. But I know there's trickiness I need to get right: I use multiple clangs for different platforms in the same codebase, and I can see logic to swap out the compiler for system clang on macOS that I wouldn't want to trip when cross compiling, e.g. for Android.

To sum: If the driver in the command provided is indeed never executed, then even 1+3 wouldn't do it, since the Bazel/Blaze wrapper does some (simple) flag substitution. Can't expect clangd to parse arbitrary compiler wrappers if it doesn't listen; it should be on the build system extractor (i.e. me 👋🏻) to get you clean compile commands. However, I would think clangd would detect built-in include paths from different clangs.

@sam-mccall
Copy link
Member

I'd assumed the compilation command was being run in some form because the log shows entries like "Failed to compile [file], index may be incomplete" only when the wrapped compiler was being passed in

If the command is "ENV_VAR=FOO wrapped_clang bar.cc" then today, we think argv[0] is "ENV_VAR=FOO", argv[1] is "wrapped_clang", and argv[2] is "bar.cc". argv[0] is not parsed per se, just scoured for clues. But the unexpected wrapped_clang "argument" will cause an error "unused linker input" or similar. That will cause the log message you mention (even if parsing otherwise succeeds).

The compiler listed in the command is never invoked, right? But clangd does use the compiler path in the command to find build-in includes for the platform (e.g. Android, Mac, iPhone), even without --query-driver, so long as it looks like a version of clang?

Correct on both counts. We reuse clang's logic for discovering standard libraries (it searches in some fixed locations, and also relative to its own binary), in an attempt to imitate it. And clang in turn knows how to imitate gcc and MS's CL.exe, so this works for some other compilers too. Mostly. Sometimes. The whole idea is pretty fragile.

I'm a bit confused about when query-driver would need to be provided and why, even after reading the docs and past discussion.

Instead of trying to guess the built-in paths, we can try to invoke the compiler itself and get it to tell us.
This is needed if the heuristics are buggy/ineffective, or if the compiler is nonstandard (e.g. you can ./configure GCC with a custom set of built-in paths, which clangd could never guess).
It works for compilers that accept the GCC-style -E and -v flags.
It's not on by default because there's a security risk in running arbitrary local binaries based on what we read in compile_commands.json, so you have to whitelist the drivers that may be queried in this way.

To sum: If the driver in the command provided is indeed never executed, then even 1+3 wouldn't do it, since the Bazel/Blaze wrapper does some (simple) flag substitution.

I think this is right. compile_commands.json is really about describing how your files are parsed, using clang flags as a convenient format for this description. So some other syntax won't really work, even if it's a valid way to compile the file. That said if they're close enough that e.g. skipping over env variables would get us close, we can do that. And query-driver does blur the lines a little.

@cpsauer
Copy link
Author

cpsauer commented Jan 21, 2021

You're the man, Sam. Totally get it. Thanks for thoughtfully squaring up my understanding.

My last (less-informed) two cents would be that perhaps query-driver should be on by default because the user's going to invoke the compiler anyway, but I'm sure you've thought about this more than I have

Seems it's time to close this as something clangd shouldn't be designed to do?

@njames93
Copy link

You're the man, Sam. Totally get it. Thanks for thoughtfully squaring up my understanding.

My last (less-informed) two cents would be that perhaps query-driver should be on by default because the user's going to envoke the compiler anyway, but I'm sure you've thought about this more than I have

Seems it's time to close this as something clangd shouldn't be designed to do?

Picture this, someone checks in a malicious program and a compile_commands.json into some repository that points to that program as the compiler driver. A unknowing user then clones that repo and opens one of the source files in the editor. Without --query-driver clangd would then invoke that "driver". Once that happens the users system is compromised, even though they never even tried to build the software. Its for this reason that you have to whitelist what drivers clangd is allowed to execute.

@cpsauer
Copy link
Author

cpsauer commented Jan 21, 2021

Hmm, I don't think having it as a flag protects against the case of a malicious project, though, right, since you can just bake the flag into the editor's settings file for that project?

To take VSCode as an example, one could just set the flag in the Project/.vscode/settings.json and then we're back in the case you described.

I'm with you on the problem, for sure. But I'm not so sure the flag fixes things enough to be worth having it broken for some users by default. Opening untrusted code in an IDE is likely to be dicy regardless.

(Also, to be clear, I don't actually need the flag or anything. Just thinking about it out loud.)

@HighCommander4
Copy link

Hmm, I don't think having it as a flag protects against the case of a malicious project, though, right, since you can just bake the flag into the editor's settings file for that project?

To take VSCode as an example, one could just set the flag in the Project/.vscode/settings.json and then we're back in the case you described.

I made this exact point in a recent discussion :)

I wonder if VSCode is unique among clients in allowing language server arguments to be determined by in-workspace metadata files, or if that's fairly typical.

@cpsauer
Copy link
Author

cpsauer commented Feb 3, 2021

I'm going to close my issue, because the main issue assumes different inner workings than those clangd actually has, and the remaining discussion has moved toward a duplicate of #539. Thanks all for being great to a new user!

@cpsauer
Copy link
Author

cpsauer commented Nov 18, 2022

@sam-mccall, I'm returning to this now as a (slightly) less naive clangd user, because I think I was wrong--and it might benefit clangd to be able to take in environment variables after all.

Since I posted this, I've been banging around on a bunch of platforms to get things working for Bazelers wanting to use clangd. (We released the tool I was building for myself, mentioned in the original post, and it's been a decent hit. Yay helping people!) [As a Googler, you may well know Bazel instead as Blaze. The ex-Googler in me does at least.]

While unwrapping or query-drivering works pretty often, I'm also seeing common platforms where environment variables are an essential part of understanding commands. As examples, msvc (and presumably clang-cl?) requires an INCLUDE env var, which is what lists the system header locations--a goldmine for clangd, I'd think. And the emscripten compiler (emcc) is pretty clang-like, except its configuration is heavily dependent on environment variables. Querying it (or running it in any form) will requires passing through quite a few environment variables. I don't personally need either of these urgently, but they've come up with users.

So I'm wondering if it might make sense to let people pass important environment variables to clangd after all. Format-wise, since an arguments list (rather than command form) is now the preferred form in compile_commands.json (especially on Windows...) maybe there should be a separate environment part of each command in compile_commands.json and in clangd config. And since environments will mostly be duplicated in compile_commands.json, maybe we should have a reference/normalization option, where they can also be IDs pointing into to a separate list of environments? Honestly, that db normalizing option would also be super valuable for commands, since, with headers, the files are often mostly made up of duplicated commands, but I digress.

Thanks for reading--and being awesome!
Chris

@cpsauer cpsauer reopened this Nov 18, 2022
@HighCommander4
Copy link

HighCommander4 commented Nov 20, 2022

While it doesn't have the full expressive power of what you are describing, it's worth noting that editors typically do propagate their environment to clangd, and clangd in turn propagates it into query-driver invocations.

So, at least for the case where, for a given project, it's a common set of environment variables that the compiler driver needs to see, you can accomplish this by having these set in the environment where the editor is executed (which will in turn propagate it to clangd etc.)

As a slight refinement of this, editors could support a way to add things into the environment for their clangd invocation, rather than requiring it to be present for the editor invocation itself (clangd/vscode-clangd#198 tracks this for vscode).

Of course, that still leaves cases where different files in a project require different env vars. The kind of clunky current workaround that comes to mind for that is to write a few compiler wrapper scripts which set the relevant env vars and then invoke the compiler, and have compile commands refer to the relevant wrapper.

If we want to improve on that, perhaps an approach with relatively low implementation complexity would be to allow specifying env vars in an Environment: key under CompileFlags: in the clangd config file? (This can then be scoped to different subsets of files with If:.)

Modifying the compile_commands.json format to have an extra field for env vars, as you describe, would be another possibility, but this would impact the larger clang-based tooling ecosystem and so it probably requires seeking consensus for it in a different forum (LLVM Discourse perhaps?)

@cpsauer
Copy link
Author

cpsauer commented Nov 23, 2022

Great ideas as always, @HighCommander4 :)

Environment wrapper scripts seems like the best workaround to me (rather than special invocations of the editor). I know the Bazel case will definitely lead to them differing by file, but with lots of repetition.

But none of these will get the rich info in the critical environment variables (e.g. on Windows) to clangd--hence the compile_commands.json proposal. I think the environment is more per command than project-level. At least that'd be the natural implementation for the use cases I'm seeing, which involve cross compilation for a subset of commands, intermixed across the codebase.

Re format and consensus: Figured we all should talk first. And the consensus part is why I figured I'd better say something as soon as it became clear that this extension might be important. You all certainly know how to build LLVM consensus better than I.

Cheers,
Chris

@sam-mccall
Copy link
Member

If we want to improve on that, perhaps an approach with relatively low implementation complexity would be to allow specifying env vars in an Environment: key under CompileFlags: in the clangd config file?

Typically environment variables might affect what the driver does (like C_INCLUDE_PATH). We can't set these on a per-file basis for the built-in driver (because env is per-process and we process multiple files concurrently). So this would only work for query-driver. I don't think we should add this to config because:

  • it is confusing that they only half-work
  • it will push people towards using query-driver (which is a hack) rather than working out how to get a compile_commands.json that describes a standard clang invocation
  • people already using query-driver already have a way to do this: wrap their driver with a script that sets the variables, and point Compiler at it. (This is more steps, but doesn't add more hacky logic and support surface area, and I think makes the situation more explicit & less confusing)

@HighCommander4
Copy link

We can't set these on a per-file basis for the built-in driver (because env is per-process and we process multiple files concurrently).

Ah, good point!

To be clear, this is an argument not only against adding an Environment: key to the config file, but other potential formulations of per-file environment configuration, such as adding an "environment" key to compile_commands.json entries, as well.

@cpsauer
Copy link
Author

cpsauer commented Dec 14, 2022

Thanks for great thoughtfulness guys, as usual.

I hear you loud and clear on the problems implementing environment variables per-file. The wrapper script does seem an okay workaround for query-driver, as of course does massaging things into standard clang flags, though that code would then have to be reimplemented per build system rather than shared in clangd.

I will say, query-driver has been invaluable in simplifying usage; we've been finding that lots of toolchains (annoyingly) choose to have wrapper scripts that basically just pass through flags but need penetrating for system includes, and query-driver does, generally, seem to be the more robust option. It makes me a little sad to think of asking directly (query-driver) as the hack; the other seems like the heuristic.

Anyway, happy to close this down if you guys would like--just thought I should correct my earlier statements once I started seeing clear evidence that popular platforms (Windows, WASM) expressed key command information as environment variables in addition to flags.

@HighCommander4
Copy link

HighCommander4 commented Dec 14, 2022

query-driver does, generally, seem to be the more robust option. It makes me a little sad to think of asking directly (query-driver) as the hack; the other seems like the heuristic.

FWIW, this is my perspective too: that query-driver is the more accurate approach for discovering built-in includes, and it's what I recommend as the default for power users. The only reason I don't recommend it as the default for casual users is that the required coordination between the value of the --query-driver command-line argument and the argv[0] in compile_commands.json seems like a hassle in cases where you don't need it.

This is also the approach taken in Eclipse CDT, where its equivalent of query-driver (called the "Built-in Compiler Settings Provider") is enabled by default and you have to uncheck it in your settings if you don't want it.


As a minor aside, there's some interesting work going on in the ISO C++ Standards Committee's Tooling Study Group on standard metadata file format specifications like this one. It makes me hopeful that at some point in the future, we might get a standard "toolchain description" format, and it would become common practice for compilers of all sorts to ship with or produce files in this format that contain information such as built-in include paths, which can then be consumed by tools like clangd, rather than relying on gcc-like driver syntax (followed by parsing free-form textual output) as a de facto standard way to get access to this information.

@HighCommander4
Copy link

at some point in the future, we might get a standard "toolchain description" format, and it would become common practice for compilers of all sorts to ship with or produce files in this format that contain information such as built-in include paths, which can then be consumed by tools like clangd, rather than relying on gcc-like driver syntax (followed by parsing free-form textual output) as a de facto standard way to get access to this information.

Just to underscore this point, within a day of me writing this we've had a bug report filed about our code for parsing the textual output of gcc -E -v -x c++ /dev/null being unreliable in the case of localized versions of gcc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request mac Mostly or entirely affects mac users upstream Bug originates with a library (e.g. clang)
Projects
None yet
Development

No branches or pull requests

4 participants