New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTFECache - caching D compiler #7239

Closed
wants to merge 1 commit into
base: master
from

Conversation

@DmitryOlshansky
Member

DmitryOlshansky commented Oct 24, 2017

A preliminary patch to cache CTFE executions in a persistent store.

This allows to reuse expensive CTFE computations across builds.
Ironically processing huge struct literal is still quite costly but is a fight for another day.

In action:

▶ cat r*.d
import std.regex;

enum r = regex(`\w`);

void main(){}import std.regex;

enum r = regex(`\w\w`);

void main(){}import std.regex;

enum r = regex(`\w\w\w`);

void main(){}import std.regex;

enum r = regex(`\w\w\w\w`);

void main(){}

▶ cat bench.sh
#!/bin/bash
for f in r*.d ; do
        echo "Compiling $f with dmd $@"
        time dmd $1 $f
done

Recent master version

▶ dmd --version
DMD64 D Compiler v2.077.0-144-gbbca650
Copyright (c) 1999-2017 by Digital Mars written by Walter Bright

▶ ./bench.sh
Compiling r1.d with dmd

real 0m3.047s
user 0m2.808s
sys 0m0.232s
Compiling r2.d with dmd

real 0m4.105s
user 0m3.832s
sys 0m0.264s
Compiling r3.d with dmd

real 0m4.682s
user 0m4.372s
sys 0m0.304s
Compiling r4.d with dmd

real 0m5.457s
user 0m5.128s
sys 0m0.320s
./bench.sh 16.14s user 1.12s system 99% cpu 17.294 total

DCache version

▶ dmd --version
DMD64 D Compiler v2.077.0-122-g7a50140
Copyright (c) 1999-2017 by Digital Mars written by Walter Bright

▶ ./bench.sh
Compiling r1.d with dmd

real 0m3.358s
user 0m3.004s
sys 0m0.348s
Compiling r2.d with dmd

real 0m3.909s
user 0m3.500s
sys 0m0.404s
Compiling r3.d with dmd

real 0m4.579s
user 0m4.152s
sys 0m0.420s
Compiling r4.d with dmd

real 0m5.332s
user 0m4.836s
sys 0m0.488s
./bench.sh 15.49s user 1.66s system 99% cpu 17.182 total

Now with caching + cache usage traces

▶ ./bench.sh -cache=mmap
Compiling r1.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Recomputed InversionList!(GcPolicy) memoizeExpr() @trusted in 0.690174
Recomputed InversionList!(GcPolicy) wordCharacter() @Property @safe in 0.690408
Recomputed Trie!(BitPacked!(bool, 1LU), dchar, 1114112LU, sliceBits!(8LU, 21LU), sliceBits!(0LU, 8LU)) codepointSetTrie(InversionList!(GcPolicy) set) pure @safe in 0.381241
Recomputed CharMatcher this(InversionList!(GcPolicy) set) ref in 0.418019
Recomputed Regex!char regex(string[] patterns, const(char)[] flags = "") @trusted in 1.111543
Recomputed Regex!char regex(string pattern, const(char)[] flags = "") @trusted in 1.112261

real 0m3.393s
user 0m3.040s
sys 0m0.348s
Compiling r2.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000605
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001930
Recomputed Regex!char regex(string[] patterns, const(char)[] flags = "") @trusted in 1.750151
Recomputed Regex!char regex(string pattern, const(char)[] flags = "") @trusted in 1.750862

real 0m2.882s
user 0m2.644s
sys 0m0.232s
Compiling r3.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000705
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001632
Recomputed Regex!char regex(string[] patterns, const(char)[] flags = "") @trusted in 2.486691
Recomputed Regex!char regex(string pattern, const(char)[] flags = "") @trusted in 2.487412

real 0m3.611s
user 0m3.212s
sys 0m0.392s
Compiling r4.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000636
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001907
Recomputed Regex!char regex(string[] patterns, const(char)[] flags = "") @trusted in 3.138693
Recomputed Regex!char regex(string pattern, const(char)[] flags = "") @trusted in 3.139398

real 0m4.285s
user 0m3.816s
sys 0m0.464s
./bench.sh -cache=mmap 12.72s user 1.44s system 99% cpu 14.176 total

2nd run DCache

▶ ./bench.sh -cache=mmap
Compiling r1.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000619
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001780
Mixed in Regex!char regex(string pattern, const(char)[] flags = "") @trusted from cache in 0.002774

real 0m1.202s
user 0m1.064s
sys 0m0.132s
Compiling r2.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000714
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001862
Mixed in Regex!char regex(string pattern, const(char)[] flags = "") @trusted from cache in 0.002727

real 0m1.247s
user 0m1.112s
sys 0m0.132s
Compiling r3.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000631
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001766
Mixed in Regex!char regex(string pattern, const(char)[] flags = "") @trusted from cache in 0.002734

real 0m1.181s
user 0m1.016s
sys 0m0.156s
Compiling r4.d with dmd -cache=mmap
Using DCache at /home/olshanskiy/.cache/dcache-v2.077.0-122-g7a50140
Mixed in InversionList!(GcPolicy) wordCharacter() @Property @safe from cache in 0.000617
Mixed in CharMatcher this(InversionList!(GcPolicy) set) ref from cache in 0.001752
Mixed in Regex!char regex(string pattern, const(char)[] flags = "") @trusted from cache in 0.002806

real 0m1.169s
user 0m1.076s
sys 0m0.088s
./bench.sh -cache=mmap 4.27s user 0.51s system 99% cpu 4.804 total

@dlang-bot

This comment has been minimized.

Contributor

dlang-bot commented Oct 24, 2017

Thanks for your pull request, @DmitryOlshansky!

Bugzilla references

Your PR doesn't reference any Bugzilla issue.

If your PR contains non-trivial changes, please reference a Bugzilla issue or create a manual changelog.

@@ -6521,6 +6520,7 @@ extern (C++) class TemplateInstance : ScopeDsymbol
}
else
{
*(cast(ubyte*)0) = 0x11;

This comment has been minimized.

@thewilsonator

thewilsonator Oct 24, 2017

Contributor

Did you mean to leave this in?

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Oct 24, 2017

Member

Yeah, was in a hurry will do a few passes of cleanup.

buf.writestring(e.sd.type.toChars());
if (strncmp(e.sd.type.toChars(), "Kickstart", 9) == 0)
{
printf("Got it %s!\n", e.sd.type.toChars());

This comment has been minimized.

@thewilsonator

thewilsonator Oct 24, 2017

Contributor

Ditto

@stefan-koch-sociomantic

This comment has been minimized.

stefan-koch-sociomantic commented Oct 24, 2017

This is a horrendous hack, and does not actually fix anything.
If the compiler has an oom computing the regex you are right where you started.

also you did not actually add dcache.d

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 24, 2017

This is a horrendous hack, and does not actually fix anything.

It does. And I'm tired of waiting for ideal CTFE.

If the compiler has an oom computing the regex you are right where you started.

Well a single regex is usually not enough to OOM. So after a couple of attempts it will cache all successful evaluations and be done. Way more practical then no solution.

also you did not actually add dcache.d

Yep, fixed.

@nordlow

This comment has been minimized.

Contributor

nordlow commented Oct 25, 2017

Why has newCTFE stalled, @stefankoch?

@andralex

This comment has been minimized.

Member

andralex commented Oct 27, 2017

Whoa this is interesting. Please rebase. cc @WalterBright

@WalterBright

This comment has been minimized.

Member

WalterBright commented Oct 27, 2017

A preliminary patch to cache CTFE executions in a persistent store. This allows to reuse expensive CTFE computations across builds. Ironically processing huge struct literal is still quite costly but is a fight for another day.

It'd be nice to put such documentation into the code, as it is currently bereft of a clue :-)

@nordlow

This comment has been minimized.

Contributor

nordlow commented Oct 27, 2017

I presume parser generator frameworks like Pegged can make direct use of this, right, @DmitryOlshansky ?

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

I presume parser generator frameworks like Pegged can make direct use of this, right, @DmitryOlshansky ?

Indeed, after the following:

static parser = Grammar!"....";

Not only the exact CTFE call with full grammar string is cached but any intermediate CTFE executions that take the form of:

enum x = ...

or

static y = ...

And such - top-level stuff. Meaning that changing the grammar doesn't lead to full recomputation of all intermediate CTFE executions. However I think Pegged is more heavily leaning on templates rather then CTFE.

Whoa this is interesting. Please rebase. cc @WalterBright

I'm working on a cache invlidation strategy and could use a help here.

My current baseline algorithm is as follows:

  1. Take an expression and extract all non-trivial types.
  2. Resolve them as symbols (this means I know where they came from)
  3. Take the module of a symbol + all of local imports in its declaration as a bootstrap.
  4. Starting with bootstrap list compute closure of imports

Finally I shall have all modules directly or indirectly referenced by an expression.

Second step - for each file at a load time I compute a signature (MD5 is fast and should work for starters).
Now I can get a module dependencies in a single unqiue string - concatenation of all signatures.
To make it more short I'll do another MD5 (this string), finally obtaining a short unqiue "dependencies key".

Now I prepand it to all my keys and get proper invalidation - if I touch any of the modules I won't find the old (stale) cached value.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

A preliminary patch to cache CTFE executions in a persistent store. This allows to reuse expensive CTFE computations across builds. Ironically processing huge struct literal is still quite costly but is a fight for another day.
It'd be nice to put such documentation into the code, as it is currently bereft of a clue :-)

I swear I thought I put this short description at the top of dcache.d but oh well... I'll do a few cleanups and get a proper heading for the module.

{
char[512] buf;
const(char)* home = getenv("HOME");
if (home == null) home = "/tmp";

This comment has been minimized.

@ibuclaw

ibuclaw Oct 27, 2017

Member

This should not be dependent on environmental variables, and you should not hardcode /tmp.

Adding a command line switch to control the cache directory would be better. Or fallback to the build directory.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Oct 27, 2017

Member

I thought using .something in a HOME directory was super common for build tools of all sorts.

Regardless "build" directory is fairly bad choice - in this case you cannot reuse cache across different builds, the point is you can and likely should.

This comment has been minimized.

@ibuclaw

ibuclaw Oct 27, 2017

Member

If you have persistent storage across builds then you can just pass -flag=/dir in compilation.

Dub might be able to leverage this too, with a per project cache.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Oct 27, 2017

Member

If you have persistent storage across builds then you can just pass -flag=/dir in compilation.

Or you could have a sensible default - use global cache (per compiler version string). There is no harm done.

Why the most useful case must be hidden behind a flag (that you can easily get wrong)?

This comment has been minimized.

@ibuclaw

ibuclaw Oct 27, 2017

Member

The compiler shouldn't depend on environmental state. Control should be given to the build tool over where cached modules are written.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Oct 27, 2017

Member

And how defaulting to build directory accomplishes that?

I’ll redo -cache switch to allow optional path to cache file. We can then discuss where the default should go.

-cache=mmap:some/path/here

Want control - pass an option, don’t want control - don’t pass the path.

Regardless - shared cache is an environmental state, in fact, it can live on a separate host. If you just wanted to have an option to pass it explicitly, I guess I’m fine with that.

This comment has been minimized.

@ibuclaw

ibuclaw Oct 27, 2017

Member

And how defaulting to build directory accomplishes that?

dmd -c src/foo.d -o obj/foo.o -cache would generate an object and compilation cache for module foo under obj. First compilation of foo.d generates the cache file, then subsequent compilations of the module reuse it.

This comment has been minimized.

@MartinNowak

MartinNowak Oct 29, 2017

Member

I'd opt for -cache to use a well-know per-user location, and -cache=path/to/whatever as optional override.
Since we had that discussion in dub recently (dlang/dub#341), you want to use $XDG_CACHE_HOME/dmd as cache directory (or $HOME/.cache/dmd as fallback).
https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html
Windows and OSX obviously having their own folder logic.
C:\Users\<user>\AppData\Local\dmd\cache and ~/Library/Caches/dmd maybe.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Nov 7, 2017

Member

Where do I put cache by default if all of the above fails? Or maybe I should fail?
XDG_CACHE_HOME is not defined on Ubuntu 16.04 for example.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

Whoa this is interesting. Please rebase. cc @WalterBright

Rebased.
Keep in mind it's PoC and until invalidation is implementated we should refrain from merging.

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Oct 27, 2017

Why has newCTFE stalled, @stefankoch?

An isolated bytecode interpreter is a lot more work than just an AST interpreter (which is already a lot of work). I tried to make this point initially, but hardly anyone wanted to hear it ;).
https://forum.dlang.org/post/nh9jfj$kai$1@digitalmars.com

Also dmd's AST is full of weird special cases, lowered code, and incomplete semantic when it hits CTFE, which made converting into bytecode even more complex and heavy to debug.
While Stefan's work is quite far atm., it's not yet complete, and now that he started working at Sociomantic (which is great), progress became a lot slower. We'll soon discuss how we can finish it.

@MartinNowak

We have another need for AST hashing (https://issues.dlang.org/show_bug.cgi?id=14894#c10), maybe there is some useful overlap between the two?

{
lock();
scope(exit) unlock();
uint h = fnv1a(key);

This comment has been minimized.

@MartinNowak

MartinNowak Oct 27, 2017

Member

We have a better string hash in dmd already.

This comment has been minimized.

@nordlow

nordlow Oct 27, 2017

Contributor

Which hash algorithm is that?

This comment has been minimized.

@ibuclaw

ibuclaw Oct 28, 2017

Member

Murmurhash iirc

This comment has been minimized.

@MartinNowak

MartinNowak Oct 29, 2017

Member

uint calcHash(const(ubyte)* data, size_t len) pure nothrow @nogc

FNV1A produces quite some collisions, and compute hashes one byte at a time is slow.

buf.writestring(";\n");
}
buf.writestring("return ");
buf.writestring(e.toChars());

This comment has been minimized.

@MartinNowak

MartinNowak Oct 27, 2017

Member

toChars was never intended as parseable representation of AST.
If you want to serialize sth. you should look at hdrgen or might even need to implement sth. based on that.

This comment has been minimized.

@ibuclaw

ibuclaw Oct 27, 2017

Member

Or serialize the the AST so the output is binary-like. Do I understand that this writes out plain text to cache? What's there to prevent tampering?

This comment has been minimized.

@andralex

andralex Oct 27, 2017

Member

Well we're not really in the business of encrypting intermediate formats so no worries there.

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Oct 27, 2017

One of the more interesting questions, would a cache still be useful for a proper interpreter?
Could help to decide the trade-off for turning this into a proper PR vs. finishing the interpreter, which isn't sth. that must be left for a single person.

* TODOs:
* - eviction policy (e.g. time based expiration)
* - sizing policy (is 128 Mb enough for everybody?)
* - better hash map algorithm (it's really dumb)

This comment has been minimized.

@MartinNowak

MartinNowak Oct 27, 2017

Member

I'm using a form of cuckoo hashing for memoize.
Takes care of eviction and collision resolution https://github.com/dlang/phobos/blob/27072f8282fd60ef2af025280d217729d2b5770f/std/functional.d#L1044.

{
//TODO: fix linear scan for free block
Data* p = data;
uint blockSize = cast(uint)(Data.sizeof + key.length + 1 + value.length);

This comment has been minimized.

@MartinNowak

MartinNowak Oct 27, 2017

Member

Oh with variable sizes entries things become a bit more complex.
Boost has allocators for shared memory, maybe sth. to draw from.
http://www.boost.org/doc/libs/1_45_0/doc/html/interprocess/allocators_containers.html#interprocess.allocators_containers.allocator_introduction.allocator_properties

You could also consider to just append new entries and sometimes run a compaction to purge overwritten/evicted values.

@MartinNowak MartinNowak changed the title from DCache - caching D compiler to CTFECache - caching D compiler Oct 27, 2017

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Oct 27, 2017

I renamed the PR as this doesn't address most of D's build performance issues with larger projects, but very specific CTFE bottlenecks.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

One of the more interesting questions, would a cache still be useful for a proper interpreter?

I firmly believe that no work is faster then some work.

In current PoC I store source code and trick semantic analysis to treat it as a literal bypassing import accessibility checks. It’s kind of slowish because it goes through full mixin code path including tokenization.

If instead I’d store it in some binary format, say protobuf or flatbuffer, it would be super fast to deserialize. It’s really hard to compute a dataset faster than protobuf deserialization of it.

Lastly - currently it won’t cache anything that is computed in less than 100ms. I bet I can deserialize at > 200 megabytes/sec on a decent CPU.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

You could also consider to just append new entries and sometimes run a compaction to purge overwritten/evicted values

Indeed I could delay a linear pass to coalesce free blocks when I run out of options to allocate an entry. The problem is searching for a free block, maybe simple bucketing can work.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

toChars was never intended as parseable representation of AST.

It calls hdrGen. Might be accidental code reuse though.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 27, 2017

I renamed the PR as this doesn't address most of D's build performance issues with larger projects, but very specific CTFE bottlenecks.

Fair enough but I see this as a step towards true parallel compilation where it wouldn’t matter if you run compiler file at a time or all files at a time.
Each of compiler instances would cooperate with others via shared cache that may even be say a memcached running on a corporate LAN shared across all employees. And since they all routinely build the same stuff a single global cache makes perfect sense.

The beauty of the approach is that any distributed key-value store will work, it doesn’t require significant complexity in the compiler to parallelize compilation and lastly no work on behalf of build tools is required.

@andralex

This comment has been minimized.

Member

andralex commented Oct 27, 2017

@DmitryOlshansky why shouldn't the cache reside by default in the same directory as object files? I fear we're getting sucked quickly into a debate on a little detail.

Regarding storing the AST using toChars, it's a good decision at least for the time being. Will work wonders with debugging. If the output of toChars is not adequate for parsing, it's less work to make it so than develop yet another format.

Keep it simple please.

@ibuclaw

This comment has been minimized.

Member

ibuclaw commented Oct 28, 2017

Regarding storing the AST using toChars, it's a good decision at least for the time being. Will work wonders with debugging.

I'd prefer there to be a line checksum included then to at least have a measure against tampering cache files.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 28, 2017

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 28, 2017

@ibuclaw

This comment has been minimized.

Member

ibuclaw commented Oct 28, 2017

If somebody can temper with cache what stops him(her) from tempering with your sources and VCS metadata.

Git already has protections against that.

If cache were to differ from the original source code that would not be trivial to debug.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Oct 28, 2017

@andralex

This comment has been minimized.

Member

andralex commented Oct 28, 2017

What's with all this discussion about tampering all of a sudden? How about tampering with object files then? Sources? Compiler? Runtime libraries? Please stay on topic. We should learn to do proper reviews without derailing into a huge discussion about nothing of importance. Thanks.

@nordlow

This comment has been minimized.

Contributor

nordlow commented Nov 7, 2017

How difficult would it be to extend the caching to include template instantiations?

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Nov 7, 2017

How difficult would it be to extend the caching to include template instantiations?

I haven't tried that but looking at compiler internals TemplateInstance is an octopus with all manner of links to AST nodes. It becomes not only a question of how to serialize such a graph but also how to merge things loaded from cache with instances computed on the fly.

I'd try it eventually, once I'm fairly certain I got CTFE right.

@andralex

This comment has been minimized.

Member

andralex commented Nov 7, 2017

I share Stefan's distaste for the whole idea

Would be great to get more background on that. Caching work is a time-honored technique and this seems a rather straightforward application of it. Thanks!

@@ -9009,16 +9012,16 @@ private extern (C++) final class ExpressionSemanticVisitor : Visitor
// Special handling for array comparisons
if (!(t1.ty == Tarray && t2.ty == Tarray && needsDirectEq(t1, t2)))
{
if (!arrayTypeCompatible(exp.loc, exp.e1.type, exp.e2.type))

This comment has been minimized.

@don-clugston-sociomantic

don-clugston-sociomantic Nov 7, 2017

Contributor

Looks as though the formatting got mucked up at this point.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Nov 7, 2017

Member

Indeed, also these lines shouldn't be touched.

@don-clugston-sociomantic

This comment has been minimized.

Contributor

don-clugston-sociomantic commented Nov 7, 2017

I share Stefan's distaste for the whole idea
Would be great to get more background on that. Caching work is a time-honored technique and this seems a rather straightforward application of it. Thanks!

I have two concerns.

My first observation is that the compilation time is not dominated by CPU time, it is almost entirely a result of excessive memory allocation. DMD already has a severe problem with memory usage, we have some apps that can barely be compiled on some of our machines. I fear that adding a cache would make the situation even worse.
And of course it does nothing to fix the problem of excessive memory usage during the initial compilation.

The second concern is that I have doubts about ensuring that the cache is really correct. Any change in any scope that's visible from any code being CTFEd, could change the results. Now maybe the algorithm described here actually works. But it seems a bit too good to be true.
I don't trust toChars, for example. And what happens with recursive structures? Do they actually work?

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Nov 8, 2017

My first observation is that the compilation time is not dominated by CPU time, it is almost entirely a result of excessive memory allocation.

My estimation is as follows.

The current CTFE takes from ~ 500ms to a few seconds on some simple regex.
Mixing in a literal from cache clocks in at ~1-2ms.
Even if newCTFE is cumulatively say 20x faster we look at 1ms vs 25-100+ ms.

If CTFE is slow because it allocates like crazy, a cache of literals is a drop in a bucket.

DMD already has a severe problem with memory usage, we have some apps that can barely be compiled on some of our machines. I fear that adding a cache would make the situation even worse.

Well at the moment it's a fixed-size 128MB memory-mapped file. Since we don't pay for pages that are not touched, the cost is zilch. 128 megs is super generous.

In my observations even the large Regex struct literals full of ulong lookup tables for Unicode are ~25K long. It took ~ 1 Megabyte for a 1/3rd of test suite for std.regex. Not every regex was cached only ones taking > 100ms.

With that on 32-bit systems with scarce virtual address space we could easily go down to say 32megs, if that is a concern.

And of course it does nothing to fix the problem of excessive memory usage during the initial compilation.

That is true. However "initial" compilation can reuse e.g. std library CTFE calls from things computed in other unrelated projects.

The second concern is that I have doubts about ensuring that the cache is really correct. Any change in any scope that's visible from any code being CTFEd, could change the results.

Yeah, that's why I try to limit it to global scopes.
Overall the idea is to limit it to things it can prove would work reliably.

Now maybe the algorithm described here actually works. But it seems a bit too good to be true.

Which is exactly the kind of thing I'd need help on the most. If something is hard, it doesn't mean we shouldn't try it. Afterall CTFE was born in a simliar fashion - to do something that wasn't even remotely possible at some point.

I don't trust toChars, for example. And what happens with recursive structures? Do they actually work?

I'd bet not. Until we have AST serializer that supports cyclic links, we might do one next.

@stefan-koch-sociomantic

This comment has been minimized.

stefan-koch-sociomantic commented Nov 12, 2017

@DmitryOlshansky Ensuring the cache is correct, pretty much comes down to hashing the AST properly.
Which is unfortunately almost impossible since the ast at this stage already got all static ifs and versions stripped out.
However you could do a best-effort by hashing the textual representation of every function body in your call graph (I'd recommend a sha265) and keep the function body hashes immutable

for usage in ctfe caching you'd walk the call-graph and mix the corresponding hashes in an order independent way.

I reckon that will give you a 98% correct behavior.
The remaining failing cases however will result in silent corruption of the binary.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Nov 13, 2017

@DmitryOlshansky Ensuring the cache is correct, pretty much comes down to hashing the AST properly.
Which is unfortunately almost impossible since the ast at this stage already got all static ifs and versions stripped out.
However you could do a best-effort by hashing the textual representation of every function body in your call graph (I'd recommend a sha265) and keep the function body hashes immutable

I do not cache nested calls and there is little point in doing so if you can just cache the top-level call. Therefore it's enough to consider all types of arguments in the top-level expression + the function that is called. Instead of taking the bodies of said types and functions we could go for more coarse grain solution of modules.

The rest of your agrument stands though I'd rather stay in 100% and just avoid caching tricky cases.

@ibuclaw

This comment has been minimized.

Member

ibuclaw commented Nov 13, 2017

My first observation is that the compilation time is not dominated by CPU time, it is almost entirely a result of excessive memory allocation. DMD already has a severe problem with memory usage, we have some apps that can barely be compiled on some of our machines.

It would not be the first time that I have wondered what if all literal nodes were allocated via a cache pool (i.e. IntegerExp's value is now a pointer to IntegerLiteral). As per what is done for Identifier nodes.

@don-clugston-sociomantic

This comment has been minimized.

Contributor

don-clugston-sociomantic commented Nov 13, 2017

Would your caching work in the case where you have a hidden dependency, eg:

int foo (string file)( int q)
{
     mixin("import" ~ file ~ ";");
     return something_magical + q;
}

If the file file changes, does it discard the cache? If it can cope with that sort of thing it is probably OK.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Nov 13, 2017

If the file file changes, does it discard the cache? If it can cope with that sort of thing it is probably OK.

It will look at all arguments and modules they come from. So it will see import as it is after mixin is applied.

I see that I need to (transitively) take modules from types of arguments as well as the modules of the arguments (if not literals).

@don-clugston-sociomantic

This comment has been minimized.

Contributor

don-clugston-sociomantic commented Nov 13, 2017

I think you just need the function arguments (not the template arguments, since they will apply to a different function) together with all of the modules which are visible inside the function.
So for your purposes it's probably not different to the simpler case:

int foo( int q)
{
    import some_file;
     return something_magical + q;
}

That is, you do need to take into account imports at any enclosing scope. Even if the import is not currently used in the function, I think it could still change.

Note that you can have horrible things like:
mixin("import "~ some_bool?"nasty;" : "nastier;");
So you can hide these dependencies very effectively.

Anyway, that's a few test cases for you.

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Dec 1, 2017

@MartinNowak Can you please show me how to use IgnoreSymbolVisibility? I want to make a scope that will bypass all lookups including to symbols in parent scopes.

See the example referenced in #7239 (comment).

@@ -0,0 +1,381 @@
// An implementation of MD5 message digest algorithm
// Copied from Phobos std.digest.md5
module ddmd.root.md5;

This comment has been minimized.

@MartinNowak

MartinNowak Dec 1, 2017

Member

Why do we need a cryptographic hash when the keys are strings and we throw away all of those digest bits anyhow?
Just using ddmd.root.hash.calcHash is fine here.

This comment has been minimized.

@DmitryOlshansky

DmitryOlshansky Dec 4, 2017

Member

The key is rather length string representation of the arguments so MD5 is nice shortening w/o losing the uniqueness requirement. Also we could use lower 64bits for first probe and higher 64-bit for second probe in cooko hashing.

static uint first32bits(const(ubyte)[] key)
{
return *cast(uint*)key.ptr;

This comment has been minimized.

@MartinNowak

MartinNowak Dec 1, 2017

Member

Please avoid non-portable unaligned reads, just use p[0] | p[1] << 8 | p[2] << 16 | p[3] << 24, it will be optimized to a 32-bit read when the architecture has hardware support for unaligned reads.

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Dec 1, 2017

So we have two proposals here:

  • a focused CTFE literal cache

This is fairly straightforward, literals are numbers, strings, hashmaps, and compositions thereof (structs). They are the outcome of applying functions to literals.

H(result) = H(function) || H(args) || H(template args)

Hashing the source file of a function and all imports available in the function, is a conservative and correct approach for a function hash.

H(function) = H(source) || H(all imports in function scope)

Hashing arguments is trivially just the literal representations hash.

H(args) = args.map!(a => a.toChars.hash).combine

Hashing template arguments can be either a literal hash or a symbol/type
with.

H(template value arg) = literal.toChars.hash
H(template symbol/type) = H(source) || H(all imports in symbol/type scope)

When the symbol is an uninstantiated template (or a type with template methods), it's tricky to get the used imports, but you could conservatively takes the hashes of all Imports inside of the template declaration (before instantiation).
Maybe there is a better approach for this case. You could also just skip those entries for now.

Since we already do import all imports, it's easy and cheap to compute a hash while reading them from disk. Sources of a project and it's dependencies rarely change, so just using file hashes should be a decent enough choice. Finer grained hashing would be more complex to implement and leaves a lot more room for bugs.

  • a long-term proposal for how to cache AST and associated codegen

The latter obviously requires a lot of work to chose a proper caching strategy, correctly hash all dependencies, and rework build (compiler invocation) and linking approaches (a linker plugin?).
The implementation complexity might be comparable to Precompiled headers.
But thoroughly thought-through, designed, and implemented, it sounds like the most serious approach to reduce build times and leverage parallel and incremental rebuilds.
I'm not sure anyone has enough resources to pull this project though. Would be great to break it down and see whether some parts could be done separately, while already improving the current situation.

@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Dec 4, 2017

So we have two proposals here:

a focused CTFE literal cache

Indeed, that is what I'm doing here. I haven't got much spare time to complete it though.

Since we already do import all imports, it's easy and cheap to compute a hash while reading them from disk. Sources of a project and it's dependencies rarely change, so just using file hashes should be a decent enough choice. Finer grained hashing would be more complex to implement and leaves a lot more room for bugs.

I did just that, making a pass over each file in Parser constructor. It's non-negligable overhead though - I got ~200ms on simple std.regex usage example. It does pull in half the Phobos though.

The end result is that:

  • on the first run we get normal compile time of a few seconds + 200ms of hashing overhead
  • on successive runs we get around 1second + 200ms of hashing overhead which is strictly faster

I believe a better strategy is to hash as we tokenize file, i.e. update hash on each token. This ensures single pass over memory. Sadly run out of time to test it out.

@MartinNowak

This comment has been minimized.

Member

MartinNowak commented Dec 5, 2017

I believe a better strategy is to hash as we tokenize file, i.e. update hash on each token. This ensures single pass over memory. Sadly run out of time to test it out.

Yes, should be done while the data resides in L1 cache.
BTW, I get 15-20ms for sha1sum on all phobos .d files, so it should be quite OK.
Also hashsums can be cached as well using modtimes.

Dmitry Olshansky
Revival of DCache, now with SHA1 as hash function + dependencies
SHA1 is actually the only hash that has fast assemlby implementation
in our std.digest.* thingy. Speed is on par with OpenSSL, the rest
including simpler and insecure MD5 are a few times slower.
@DmitryOlshansky

This comment has been minimized.

Member

DmitryOlshansky commented Jan 29, 2018

Revived, now with dependencies tracking. Using scopes didn't quite pan out due to them often being zero for imported symbols.

Still the ugly cachedSemantics flag is there, I can't seem to get IgnoreVisibility flag to work. Will look into that next.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment