Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extended Dependency Generation #245

Open
wants to merge 3 commits into
base: master
from

Conversation

@DavidEichmann
Copy link

commented Jul 2, 2019

This is a proposal to add a new GHC feature that would output detailed build dependency information in a machine readable format.

Rendered

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Jul 2, 2019

I have questions about this proposal as is:

  • There is no mention of overlap with -M
    • "the module's direct imports" is largely covered by "-M"
      • well... there is maybe some extra output in this proposal:
        • "the path where the source file was found"
        • "package ID where the module was found"
        • "In both cases a list of file paths where GHC looked for the import before finding it"
  • It's not clear why "language pragmas" and "module-level deprecations" should be included. The source file will change so we already know recompilation is required. Yes "cabal-install might want to do metadata consistency checks" but is that really a good justification? It sounds oddly specific. Is this actually intended just for a cabal-install usecase? If so we should explore the use case in more detail.
  • The description of the algorithm to find all dependencies that must be rebuilt sounds wrong (it traverses down to the dependencies of changed files, when instead it should traverse up to dependants).
  • If this is really about recompilation, then we can perhaps make a clear statement about the scope of information that this new feature should output: output exactly the information needed to infer if recompilation is not necessary ("not necessary" is more accurate; inferring if recompilation truly is necessary is impractical)
@Ericson2314

This comment has been minimized.

Copy link

commented Jul 2, 2019

I am totally for the goal, but suspicious of the details. Namely this pre-/post- stuff does match what GHC does today, but is a fundamentally impure formalism: GHC needs to read other files than its pre-depends, so there's no way to sandbox build steps up front. There's no way I could directly plug this info into Nix.

The corresponding pure formalism is dynamic /monadic dependencies. Some build steps instead of creating regular file outputs create more build plan. For example for the TH case GHC really ought to serialize it's continuation as a dynamic build step with an extra dependency on the to-be-read file. Now it may just be easier to restart to job with the extra dep than "really" serialize the continuation, but that can be viewed as an implementation detail.

I would really like us to start with the pure formalism now, shoehorning the compile as needed in the short term but improving the fit as time goes on.

CC @taktoa.

@DavidEichmann DavidEichmann changed the title WIP Copy from GitLab wiki WIP Extended Dependency Generation Jul 2, 2019
@DavidEichmann DavidEichmann force-pushed the DavidEichmann:ExtDepGen branch from d4aa088 to fd240ea Jul 2, 2019
@Ericson2314

This comment has been minimized.

Copy link

commented Jul 3, 2019

OK here's a concrete proposal. GHC takes a list of paths which it assumes are up to date. If it only reads files within the set, the operation succeeds. If it needs read files outside of that step, it fails, but provides the missing path in the dumped info. This is a small change, but effectively mediates between GHC and the external build system.

  • In the case of an unsandboxed build system like Make or Shake or Bazel. The file might already exist so this acts is a crude form of sandboxing to make sure there are now missing dependency edges.

  • In a sandboxed build system like Nix, the sandboxing part of this is not needed, but it still provides a "hook" for Nix to build the needed file, if it isn't static in the source. Actually I no longer think nix needs this.

This is still distasteful to me; not one of the nice proper architectures for incremental computation. But ghc --make is really bad and I can't let the perfect get in the way of the better.

@Ericson2314

This comment has been minimized.

Copy link

commented Jul 3, 2019

On a completely different note #243 along with https://gitlab.haskell.org/ghc/ghc/issues/10871 is very good for these sorts of things. Note that once all splices are run, GHC still has must of its work cut out for it, yet all the dynamism is gone. If we separate the TH cleanly, and serialize the post-splicing parsed source in one of those sat interface files, we can invoke GHC twice where the second time it does most of the slower work. Note this all means we can be fine grained in two dimensions: invoking GHC once per module, and twice per pipeline (two groupings of the pipeline stages).

In particular, each TH stage can get it's own cheep downsweep, and spliced module imports no longer ruin everything since stage separated imports limits the dynamism induced by them.

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Jul 4, 2019

@Ericson2314, We discussed a bit over IRC, would this be an accurate change to the proposal:

ghc will collecting dependencies throughout compilation and reporting them all (even if compilation fails). If compilation fails due to a missing dependency (or dependencies), then the missing dependency(ies) should be reported as such. A "reasonable" effort should be made to report as many missing dependencies as possible before failing. "Reasonable" means avoiding doing too much more work.

The motivation for this is that a build system would then be able to generate the missing dependency and continue (in practice restart) the build. Without this behaviour, the build system may have no easy way to discover and generate such dependencies.

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Jul 4, 2019

OK here's a concrete proposal. GHC takes a list of paths which it assumes are up to date. If it only reads files within the set, the operation succeeds....

Introducing some crude sandboxing is an interesting idea. It seems like an easy next step implementation-wise. IIUC this would be a way to check that the build system has a complete set of dependencies, which would help to identify bugs in a build system.

It feels to me like this is slightly out of scope of this proposal. I Imagine it would be implemented as a separate feature. Do you think it should be included in this proposal?

@Ericson2314

This comment has been minimized.

Copy link

commented Jul 4, 2019

@DavidEichmann

We discussed a bit over IRC, would this be an accurate change to the proposal

I agree with that.

It feels to me like this is slightly out of scope of this proposal. I Imagine it would be implemented as a separate feature. Do you think it should be included in this proposal?

Actually I no longer think that is needed.

First some background for everyone not on IRC at that time. When we say "fails due to a missing dependency", I think it's important to consider that a missing file and out of date file are semantically both missing. With a traditional un-sandboxed build system, GHC has no way of knowing whether the file it reads is stale or not. Traditionally with -M, build systems use the previous run to calculate the dependencies: in the worst case everything is done from a state downsweep every time!

I really want to avoid this buggy statefulness. But that doesn't mean helping build systems that aren't sound be sound; that's their problem. I realized how the proposed stuff can work with Nix (or Nix + NixOS/rfcs#40) without changes, so that satisfies me. For anyone curious, here's how:

  1. Downsweep (what makes the precompilation dependencies) has to be run every time "anything" changes. What is anything? Any Haskell file part of the current component, and any any #included file that makes imports. (TH cannot make imports.). This means that generated Haskell source (by the external build system) has to be rebuilt often too. There's just no way around this. There are 3 optimizations:

    • Incremental downsweep which is run just on the files that changed. (Note files that changed, not subgraph that changed; changed modules can depend on already downswept modules and vice-versa).
    • Preprocessing each module so that information irrelevant to downswep is removed. This way, hash-based caching has less spurious eviction.
    • If there is an a-priori determined main module or exposed-module list, we can use that as a root-set via the trampoline method mentioned below, rather than requiring that all source and CPP includes be built up front.
  2. Dynamic dependencies are handled with a "trampoline" build step. The trampoliner needs to see all files, or rather the rules to build them, so every time the TH-runner etc fails, it can restart it with more in it's sandbox. Yes all the restarts make for a schmiel the painter quadratic runtime, which sucks, but at least it is sound, and the final successive run has a minimal set of dependencies.

    • Optimization aside: The magic FUSE-augmented build systems which build files on demand when processes attempt to open them avoid some restarts, but not all: in the "on line" build system case one doesn't assume that "raw source" leaf imputs change slower than the build system traverses the graph. That means building is no longer disjoint rounds of purely monotonic progress. If deep dynamic dependency changes a lot, we still need to either re-run all of the program up to the opening of that lazy-produced file (not incremental), or core dump the program (crudely serialize continuation) at each opening of such a lazily-produced file for O(1) resumption.

So again avoiding worrying to much perfection, I'm satisfied that whatever GHC throws at us today can be shoehorned into soundness with a sufficiently crafty build system. I then hope such a build system can be used to speed up development of GHC itself, which will make it practical for the first time to undertake the major refactors that allow for much idealistic schemes ("dank incremental GHC"). The better in fact begets the perfect.

@DavidEichmann DavidEichmann force-pushed the DavidEichmann:ExtDepGen branch 2 times, most recently from 5fdc8f4 to 2172f25 Jul 26, 2019
@DavidEichmann DavidEichmann changed the title WIP Extended Dependency Generation Extended Dependency Generation Jul 29, 2019
@DavidEichmann DavidEichmann force-pushed the DavidEichmann:ExtDepGen branch 2 times, most recently from 32f594a to f5b0858 Jul 30, 2019
@DavidEichmann DavidEichmann force-pushed the DavidEichmann:ExtDepGen branch from f5b0858 to 851d2e6 Jul 30, 2019
@ulysses4ever

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

In the scope of my summer internship at Tweag, this is very relevant. I'm working on improving support for Haskell projects in the Bazel build system. Specifically, I'm working on building a persistent worker wrapping GHC. I've been able to make a working prototype using just GHC API but this doesn't scale for real challenges such as incremental compilation.

We haven't worked out all details so far but here is a rough sketch of why I need this proposal to happen. Bazel sends compilation requests to a resident process (the worker) with subsets of project contents as inputs and expects the worker to build a target from the inputs (either a Haskell library or binary). In order to cache results of serving such requests, we need to get a handle on dependencies inside that inputs and recompile in one-shot mode only what is needed.

@Ericson2314

This comment has been minimized.

Copy link

commented Jul 31, 2019

@ulysses4ever you should talk to @DanielG whose GSOC work on GHC is for doing basically the same thing for the Haskell IDE engine instead of Bazel.

@ulysses4ever

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

@Ericson2314 thanks for the heads up! We discussed this a bit with @mpickering, who also referenced that GSoC project. My current understanding is that there is a subtle but essential (in my view) difference between these two tasks. Namely, HIE serves compilation requests with the per-file granularity. While Bazel sends compilation requests with per-subset granularity expecting the server to produce a linked result (a library or binary).

@Ericson2314

This comment has been minimized.

Copy link

commented Jul 31, 2019

@ulysses4ever By "subset" you mean a subset of all files, which are invalidated since last invocation?

That's a difference but I don't feel like it's an essential one. Perhaps you need this proposal more than him, but that's fine. I suppose my point was a more off topic one: everyone benefits from a better division of labor in GHC between the "pure functional compilation" and build-system-esque crawling the files and caching. Haskell-ide-engine and Bazel want to be completely in charge of the latter.

Given the prominence of LSP, I can imagine a future version of Bazel that tracks individual spans, not just files, and likewise expects the persistent worker to take individual span updates. Then yours and @DanielG's use-cases converge.

@ulysses4ever

This comment has been minimized.

Copy link
Contributor

commented Jul 31, 2019

By subset I mean a set of files in the project that are known to constitute a target (bin or lib) + their hashes. From these data, the worker should decide by itself what is up to date and what is not since the last invocation. It also has to preserve intermediate artifacts like .o and .hi.

You may be quite right about convergence, I agree.

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Aug 1, 2019

@phadej, Thanks for the questions via IRC. I'll try to respond here and in the next few comments.

How can we discover any dependencies before running CPP?

Good question! You are right, cpp must be run before the precompilation dependencies are discovered. I think the reason CPP deps are listed after the imports is because -M currently does a dependency analysis (which does a CPP pass) to get the imports, but then we do a parse to extract CPP includes from the preprocessed file (which will contain some pragmas for each cpp include). I think it may be possible to extract the CPP includes without doing the extra parse, but that's a separate issues. In either case we could consider reporting CPP includes as part of the precompilation deps, but we should consider performance costs before deciding on this.

EDIT

I've decided we to simply move CPP deps to the precompilation dependencies. When implementing the -include-cpp-deps option, I decided the cost was small relative to a full build (in the context of a Hadrian build of GHC).

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Aug 1, 2019

Why do we report plugins twice: as OPTIONS_GHC pragmas and also as part of the dynamic deps?

Yes this is a good point. This was in the original proposal, and I left it there case there is some other options that could introduce dependencies, but I think this is a poor justification. We don't want consumers of this information to have to parse ghc command line arguments. If there is something from the options that introduces dependencies, then we should report those things explicitly as we do with plugins.

  • Remove OPTIONS_GHC pragmas from precompilation deps.
@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Aug 1, 2019

Plugins can declare if recompilation is necessary based on the plugin options (see plugins docs and proposal).

Thanks for pointing this out. It looks like we could incorporate this info reporting the PluginRecompile returned by the plugin, though I think the MaybeRecompile Fingerprint may be a bit inconvenient for external build tools to take advantage of. They will need some way to query for a new fingerprint (which shouldn't be too hard to do). But this doesn't work well in cases where the plugin introduces file dependencies based on the source of the module being compiled. The proposal's Drawbacks section recommends that the plugin returns ForceRecompile in that case, but ideally we'd like to avoid more recompilation. On the other hand, IIUC, plugins should be able to call addDependentFiles, so perhaps this is sufficient. @mpickering, any thoughts here?.

EDIT

I've confirmed that plugins can in fact use addDependentFiles.

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Aug 1, 2019

@ulysses4ever, great to hear there is another potential use case! Is there anything in particular that you think is missing from the output we intent to generate?

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Aug 12, 2019

As there seems to be little interest in adding a {-# DEPENDS fileA fileB ... #-} pragma, I'll leave this as future work.

Remove `OPTIONS_GHC` pragma output.
Note `addDependentFile` may come from plugins as well as TH.
Remove all items in the unresolved questions section.
A DEPENDS pragma is left as future work as there doesn't seem to be any
interest in this.
@Ericson2314

This comment has been minimized.

Copy link

commented Aug 14, 2019

Just-opened https://github.com/michaelpj/ghc-proposals/blob/white-box-interface-files/proposals/0000-white-box-interface-files.rst will help in that if you do only some of compilation, you should be able to "save your progress" not just output dependencies.

I think it would want an untyped-HIE file for parsing, maybe also renaming, and certainly hi-wb files for type checking and desugaring. This would permot extremely fine-grained incremental compilation. hi-wb files can also be used to keep the hi files "free" of specialized and inlinable definitions, so type checking isn't unecessarilly invalidated.

This conbines well with #243 for removing TH as I mentioned above, too.

CC @michaelpj.

@ndmitchell

This comment has been minimized.

Copy link
Contributor

commented Sep 8, 2019

This information would be super helpful to Hadrian I suspect (cc @snowleopard). At the moment to compile N modules correctly requires running ghc -M over all N (which takes O(n)), followed by ghc -c over all N (which takes O(n)). If just 1 module changes you have to rerun ghc -M then 1 ghc -c, but it's still O(n). For big projects, that N can start to dominate. Looks like a very viable modern alternative to ghc -M, which is definitely not fit for purpose anymore.

Only query is why allow globs in the files queried? I don't see why that is ever useful, but it does seem to complicate the design, not least by requiring a semantics for globs.

@Ericson2314

This comment has been minimized.

Copy link

commented Sep 8, 2019

Related to the above, I'm thrilled that http://www.well-typed.com/blog/2019/08/exploring-cloud-builds-in-hadrian/ has brought up these filesystem access errors. I believe we can and must get to 0, and that is basically the criteria for a good design for this feature.

@snowleopard

This comment has been minimized.

Copy link

commented Sep 9, 2019

@ndmitchell Indeed, this should help Hadrian too! Let me also link this relevant discussion:

https://gitlab.haskell.org/ghc/ghc/issues/16253#note_168035

@DavidEichmann

This comment has been minimized.

Copy link
Author

commented Sep 10, 2019

@ndmitchell, the motivation to include globs was to capture cases where GHC might itself use glob patterns but now that you mention it I did a little searching and I suspect GHC never uses glob patters to find dependencies. I'll remove the glob patterns from the proposal, as they don't seem necessary.

These were intended to capture the cases where
GHC uses glob patterns, but I suspect GHC does
not do that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants
You can’t perform that action at this time.