Introduce aggregated result files #218

jrosdahl · 2018-01-27T13:45:29Z

jrosdahl
Jan 27, 2018
Maintainer

Idea: We could introduce "aggregated result files" (other name suggestions welcome) that store all of a cached result in one file instead of in one to seven files as done on master at the time of writing.

This idea has previously been mentioned on the mailing list and in #58 (comment) and #213 (comment).

Currently, a cached result is stored in the cache in the form $md4hash-$length.o, $md4hash-$length.stderr, $md4hash-$length.d, $md4hash-$length.gcno, $md4hash-$length.su, $md4hash-$length.dia and $md4hash-$length.dwo. This has a couple of advantages:

It's a straightforward and easy to understand.
It's possible to use hard links to copy files from the cache.

There are advantages to storing the result in a single file, though:

Less i-nodes and directory entries are used. This is good for at least two reasons:
1. Cache hit/miss performance will be higher due to less (potential) disk seeks and less system calls.
2. Cleanup performance will be higher since smaller directories need to be scanned, less stats need to be made for the cleanup algorithm's LRU behavior, etc.
The result can less easily become corrupt. For example, if one of the currently stored files becomes zero length in the cache (which could happen on flaky systems or due to power cycling, etc.) the whole result is invalid but ccache can't detect it and will deliver a broken result.

jrosdahl · 2018-01-27T13:56:19Z

jrosdahl
Jan 27, 2018
Maintainer Author

Replying to @afbjorklund's comment in #213 (comment):

It got tedious, to pile on new file formats all the time...

I don't see why that would be an issue if a more flexible file format is used (similar to what you sketched further down in #58 (comment))?

Not sure how it would work in the context of #98, though ? Harder to share object files, done this way.

I don't think that it will be significantly harder. On the other hand, cache deduplication is very low priority for me since it adds complexity.

Adding support for .pack files would be great, too. Then you wouldn't have all those pesky small files.

I'd be happy to hear more about your ideas here, but spontaneously I'm a bit sceptical. 🙂 Do you mean the actual Git .pack format or just something similar? And should it then be a per-cache file or per-level-1 or per-level 2 file, etc? How should cleanup be done?

0 replies

afbjorklund · 2018-01-27T14:01:49Z

afbjorklund
Jan 27, 2018

What I meant to say was that the amount of new such file formats, was driving this exact change...

I meant the concept of the git pack files, not necessarily the format. Maybe we can use libarchive ?

0 replies

jrosdahl · 2018-01-27T20:37:13Z

jrosdahl
Jan 27, 2018
Maintainer Author

I meant the concept of the git pack files, not necessarily the format. Maybe we can use libarchive ?

Oh, but then you surely mean what to use for the "one single file per result" approach? When you mentioned Git's pack files, I naturally thought you meant that all results in the cache should be put in one or a small number of pack files, since that it is how pack files work in Git.

Regarding libarchive: Yes, perhaps!

0 replies

afbjorklund · 2018-01-28T09:49:44Z

afbjorklund
Jan 28, 2018

I think it might be slightly different concepts, but both are good ideas.

Combining the current several files into one file per hash is good, because of all the reasons listed.

Combining several hashes into a pack file also has benefits, but needs locking and other complexities...

If I recall correctly, git sometimes writes "loose objects" and then combines them ?

0 replies

afbjorklund · 2018-01-28T09:55:48Z

afbjorklund
Jan 28, 2018

The result can less easily become corrupt. For example, if one of the currently stored files becomes zero length in the cache (which could happen on flaky systems or due to power cycling, etc.) the whole result is invalid but ccache can't detect it and will deliver a broken result.

Should also note that we discovered that a side effect / bonus of compression was checksumming...

So that is available now, but including it in the combined format would make it available to everyone ?

There is also no "fsck" option available, anyway.

Suppose you could just call gzip --test, but...

0 replies

jrosdahl · 2018-01-29T20:48:19Z

jrosdahl
Jan 29, 2018
Maintainer Author

Combining several hashes into a pack file also has benefits, but needs locking and other complexities...

Exactly, I'm not sure that combining results in files like that will be a good fit for ccache:

ccache caches are typically much larger than Git repos, so the time required for packing loose objects could be quite painful.
And for cleanup, the pack files will need to be rewritten again, something that I think happens seldomly in Git (I would guess only if branches are deleted and objects become unreachable).
Git repositories don't have a maximum size to maintain.
I'm pretty sure that Git's detection of garbage collection done by concurrent clients won't work well on NFS.

Perhaps it would be possible to do if we have lots of time, but given time constraints I'm inclined to just not think more about it since it seems too hard.

If I recall correctly, git sometimes writes "loose objects" and then combines them ?

Yes, it writes all new objects to a directory structure similar to ccache's cache directory. Then when a threshold is reached, the loose objects are combined into a pack, which may take a while. And when there are too many packs, the packs will be consolidated into larger packs.

Should also note that we discovered that a side effect / bonus of compression was checksumming...
So that is available now, but including it in the combined format would make it available to everyone ?

Yes, I actually think that we should use lz4 with its checksumming by default (or always) for the fat files.

0 replies

jrosdahl · 2018-01-30T19:54:14Z

jrosdahl
Jan 30, 2018
Maintainer Author

Perhaps it would be possible to do if we have lots of time, but given time constraints I'm inclined to just not think more about it since it seems too hard.

(Sorry if I sound negative. Perhaps there is a relatively easy solution that I just don't see at the moment?)

0 replies

afbjorklund · 2018-01-30T20:10:32Z

afbjorklund
Jan 30, 2018

Combinining individual files for each result into one sounds like a good start, and easier to do too...

I would prefer a different name than "fat files", because I normally associate that with fat objects:

And it might be offensive...

So better use another name ?

0 replies

afbjorklund · 2018-01-30T20:20:26Z

afbjorklund
Jan 30, 2018

Could maybe go with the simple format*, and then just do compression on the file level (like before) ?

#58 (comment)

    uint32_t suffix_len; // network endian
    char suffix[suffix_len];
    uint32_t len; // network endian
    char data[len];

Assuming that the filename still has the hash. Or maybe add a default suffix to it, to keep DOS happy.

$md4hash-$length:
.o
.stderr
.d
.gcno
.su
.dia
.dwo

0 replies

jrosdahl · 2018-01-30T20:21:30Z

jrosdahl
Jan 30, 2018
Maintainer Author

So better use another name ?

Brain-storming some potential names: accumulated, aggregated, bunch, bundle, combined, compact, lump, pack (:slightly_smiling_face:), pile

On the other hand, in the code the concept perhaps could be just "result" and in the filesystem it could be $md4hash-$size.result?

0 replies

jrosdahl · 2018-01-30T20:26:05Z

jrosdahl
Jan 30, 2018
Maintainer Author

Could maybe go with the simple format*, and then just do compression on the file level (like before) ?

Yes, that's in line with my thoughts as well. If we want to keep supporting hardlinking, the header will need some info saying if the file data is stored in a separate file or not as well.

0 replies

afbjorklund · 2019-04-21T19:48:02Z

afbjorklund
Apr 21, 2019

Started to work on this feature...

There are two additions to start with:

one is a structure to aggregate all files that will go into the aggregated result file (path + suffix)
the other is a command to write these out to storage, using an internal format similar to a zip file

Unlike the memcached format, we don't want to require all the data to be first read into memory.
So the file implementation will record the file offset instead, and seek into the open file descriptor.

0 replies

afbjorklund · 2019-04-21T19:53:36Z

afbjorklund
Apr 21, 2019

Checksumming and (optional) compression will still be done on the individual file level.
This makes it easier to integrate with the current code, so more like .zip than like .tgz.

0 replies

jrosdahl · 2019-04-21T20:43:23Z

jrosdahl
Apr 21, 2019
Maintainer Author

Thanks for starting looking into this! I would much appreciate a more detailed explanation of your design and plans when you have time to do so.

Should we maybe spin up a dev/aggregation branch or so? Then we could both do pull requests into it.

Here are some open questions that I think need to be decided before an implementation can be finalized or maybe even started for real:

What should the default compression format be?
Should we support multiple compression formats?
Should we support uncompressed files?
Should we ditch the hard_link mode completely or should it remain supported?

I already have opinions on the above of course, but I'm curious to know what you think since you started working on the feature.

Comments on what you wrote so far:

one is a structure to aggregate all files that will go into the aggregated result file (path + suffix)

Why would the aggregated file need to store paths (to individual result files, I assume)?

Unlike the memcached format, we don't want to require all the data to be first read into memory.

Great, yes, it needs to be streamable.

So the file implementation will record the file offset instead, and seek into the open file descriptor.

I didn't quite follow you here. Could you expand on when the code would need to seek into the file?

Checksumming and (optional) compression will still be done on the individual file level.

Why do you prefer that instead of compressing the whole file?

0 replies

afbjorklund · 2019-04-21T21:28:43Z

afbjorklund
Apr 21, 2019

Basically what I wanted was something that would drop into the current framework...

As for the individual files, those are for aggregating the inputs. As in: output_obj etc
This is just so that we don't have to load the contents of those files into memory.
Note that these files are either created by the compiler, or need to go into the output.
So we don't really have a choice, it is the cached_obj and friends that disappear.

	add_cache_file(cache, output_obj, ".o");

The offsets were collected when preparing to read the files from the aggregated file.
We didn't want to load the all files into memory, but don't know when to output them.
So we store the file offsets, waiting to stream them into an output file when requested.
Different storage backends could have different strategies, e.g. memcached uses pointers

		get_cache_file(cache, output_obj, ".o");

Compressing individual files is to stay compatible, and to avoid temporary storage.

0 replies

jrosdahl · 2019-04-29T20:36:24Z

jrosdahl
Apr 29, 2019
Maintainer Author

For the cleanup operations, we need to search through the cache to find the oldest objects. This mean that each object needs to have a mtime (integer), and it's convenient to know the (total) size as well.
[...]
So we cannot rely on struct stat being available, since that only applies to files.

Whoa there, hold your horses... OK, so you're talking about needs that future backends might have. I don't see why we should embed bookkeeping information like that into the result files. In my mind, that should be backend-specific metadata; for a plain filesystem: mtime and size, for (say) Redis: standard LRU eviction configuration, etc.

Rationale for not keeping a "recently used timestamp" inside the result object: It will be a performance killer for cleanup. Such information needs to be stored in a more efficient manner.

To keep it simple, the plan is to use the same system even for the filesystem-based implementation...

When you wrote '"simple" 0x31d6cfe0d16ae931b73c59d7e0c089c00000 (network byte order)', I naturally assumed you meant a binary representation because you wrote "0x" and "network byte order". That could make sense for backends that can use any binary string as the key (but you can't use that for filesystem names, of course, since bytes 0x00 and 0x2f aren't allowed). But now it seems you mean going from one human-readable format to another human-readable format... Or not, since you posted PR #399? OK, so by "the same system", you maybe meant "binary where possible in the future, but a new human-readable format for file systems now"? You continue to confuse me... 🙂

[3/1/d6cfe0d16ae931b73c59d7e0c089c0-0.o -> 31/d6cfe0d16ae931b73c59d7e0c089c00000]

You already know this, but pointing it out anyway: going from 3/1 to 31 affects where stats file are located and therefore how cache cleanup will be done. I'm not sure if we should make this change as part of introducing aggregated results since it will require even more thinking and designing. For instance, one big issue with a change like this is the question of backward compatibility with older ccache versions using the same cache directory.
I would prefer to keep file extensions (.manifest and .result are my suggestions) for general discoverability.
I don't really see the point of using fixed-size sizes in filesystem names.

The manifest files would use the same storage, so you are right that they would be affected as well.
As mentioned earlier, this means rewriting some of that code to work with memory as well as files...

I'm not sure if I understand what you mean by this. If you mean that refactoring needs to be done when introducing a generic backend API, yes, sure. If you mean that something else, please expand.

I think that there are a bit too many topics in the discussion mix, many of which depend on each other. We need to come up with a way to divide the work into parts that can be discussed relatively independently.

Another problem is maybe that I actually already have thought quite a lot about many of the topics in the discussion during the last two years or so, but haven't had time to write down those thoughts and conclusions.

My suggestion is: let's focus on the result file content in this issue. How lookup based on a key (currently hash+size) in the current filesystem backend should work, how a future generic backend API could look like, etc, are all very interesting topics, but I don't think they need much discussion when in comes to designing the result file content format. Sounds reasonable?

0 replies

afbjorklund · 2019-04-30T16:40:32Z

afbjorklund
Apr 30, 2019

So we cannot rely on struct stat being available, since that only applies to files.

Whoa there, hold your horses... OK, so you're talking about needs that future backends might have. I don't see why we should embed bookkeeping information like that into the result files. In my mind, that should be backend-specific metadata; for a plain filesystem: mtime and size, for (say) Redis: standard LRU eviction configuration, etc.

Rationale for not keeping a "recently used timestamp" inside the result object: It will be a performance killer for cleanup. Such information needs to be stored in a more efficient manner.

Just meant that it needs to go through the storage backend, so it can decide. It's not really inside the result object, just kept next to it (like you mention). Right now it's in the file stat.

When you wrote '"simple" 0x31d6cfe0d16ae931b73c59d7e0c089c00000 (network byte order)', I naturally assumed you meant a binary representation because you wrote "0x" and "network byte order". That could make sense for backends that can use any binary string as the key (but you can't use that for filesystem names, of course, since bytes 0x00 and 0x2f aren't allowed). But now it seems you mean going from one human-readable format to another human-readable format... Or not, since you posted PR #399? OK, so by "the same system", you maybe meant "binary where possible in the future, but a new human-readable format for file systems now"? You continue to confuse me...

It is "simple" because it is one key rather than two (hash + size), but it is more complex since it is binary - the representation I wrote was made up - just easier to type than \x31\xd6 (and so on)

[3/1/d6cfe0d16ae931b73c59d7e0c089c0-0.o -> 31/d6cfe0d16ae931b73c59d7e0c089c00000]

You already know this, but pointing it out anyway: going from 3/1 to 31 affects where stats file are located and therefore how cache cleanup will be done. I'm not sure if we should make this change as part of introducing aggregated results since it will require even more thinking and designing. For instance, one big issue with a change like this is the question of backward compatibility with older ccache versions using the same cache directory.

It was not going to be backwards compatible, especially not if changing compression/checksum.

I would prefer to keep file extensions (.manifest and .result are my suggestions) for general discoverability.

I don't really care about discoverability or file storage, those are already covered by today's system...

I don't really see the point of using fixed-size sizes in filesystem names.

It moves keys from the current 42-53 bytes down to 20 bytes, and avoids the dynamic allocation.

The manifest files would use the same storage, so you are right that they would be affected as well.
As mentioned earlier, this means rewriting some of that code to work with memory as well as files...

I'm not sure if I understand what you mean by this. If you mean that refactoring needs to be done when introducing a generic backend API, yes, sure. If you mean that something else, please expand.

Storing the manifest files are easy, we already did that for the memcached backend. But it is also possible to write them directly to memory without going through the file API, should it be needed (later on).

I think that there are a bit too many topics in the discussion mix, many of which depend on each other. We need to come up with a way to divide the work into parts that can be discussed relatively independently.

Yeah, I would move the storage backends and such to a different discussion.

Another problem is maybe that I actually already have thought quite a lot about many of the topics in the discussion during the last two years or so, but haven't had time to write down those thoughts and conclusions.

Most of my thoughts are in the code, but most of the code is not being merged.

My suggestion is: let's focus on the result file content in this issue. How lookup based on a key (currently hash+size) in the current filesystem backend should work, how a future generic backend API could look like, etc, are all very interesting topics, but I don't think they need much discussion when in comes to designing the result file content format. Sounds reasonable?

Sure thing. I already posted a code skeleton, which was most based on the .manifest format. I can rename those .cache files to .result, and then remove the individual storage/retrieval of files.

I was mostly interested in improving the memcached code, and this story is just 1/5 (says #58):

Introduce aggregated result files
Refactor the current file-based code into a backend interface.
Introduce a way of doing backend-specific configuration.
Devise a good way of configuring which backends to use and in which order.
Implement the memcached backend as a plugin.

Adding compression type into this particular story is not needed either, it can continue with gzip.

0 replies

jrosdahl · 2019-05-01T08:46:51Z

jrosdahl
May 1, 2019
Maintainer Author

Just meant that it needs to go through the storage backend, so it can decide. It's not really inside the result object, just kept next to it (like you mention). Right now it's in the file stat.

Great! Then it sounds like we're on the same page regarding result file content. Let's discuss the backend API later.

It was not going to be backwards compatible, especially not if changing compression/checksum.

Yes, changing anything that affects the key calculation or key format obviously invalidates pre-existing results stored in the cache. That's nothing strange and has happened many times during the years, for instance when adding new hash material.

But what I'm talking about is backward incompatibility when it comes to cleanup. Right now, it works OK enough to run different incompatible versions of ccache against the same cache directory since a) statistics counters are kept in the same place (and with the same format and semantics), and b) the cache directory structure is similar enough. If we change this, then the different ccache versions won't see each others files and size statistics anymore. We then have to decide if we should live with this likely confusing situation or do something about it. In any case, we need to make a conscious decision. Makes sense?

I don't really care about discoverability or file storage, those are already covered by today's system...

I don't understand what you mean. Isn't changes to today's system what we are discussing?

I was mostly interested in improving the memcached code, and this story is just 1/5 (says #58)

I appreciate that you want to make things happen so that a memcached backend can be realized. Sorry if I'm applying resistance, but I strongly feel that we need to do things in a certain way and order for me to be able to keep up with the development and continue maintaining the code base.

I think I need to take some time to write down a more concrete plan of what I would like to see. The bullets you refer to in #58 (comment) are mostly handwaving.

0 replies

afbjorklund · 2019-05-01T11:16:59Z

afbjorklund
May 1, 2019

I don't really care about discoverability or file storage, those are already covered by today's system...

I don't understand what you mean. Isn't changes to today's system what we are discussing?

What I mean is that we already have had that for a while, I want to try something different (i.e. make the cache internals opaque to the user, and then optionally expose them using helpers where needed)

The compability (with clean and such) is a problem, but maybe it could be "fixed" by moving into a separate directory such as ~/.cache/ccache. Anyway, it needs to be separate from the previous directory.

Updated the sample code, which is just forking the code to see which parts are affected by this.
The names used are still horrible and there are things missing (mostly reading from the cache)

Two of the current optimizations in the source code were affected by the introduced abstration:

Files need to be written to a temp dir, not the directory of cached_obj
Stderr output needs to use a temp file, not read from cached_stderr

As per above, this leaves the aggregated cached_result files using mtime and unlink like before.

0 replies

afbjorklund · 2019-05-01T11:23:27Z

afbjorklund
May 1, 2019

Also, the current tweak doesn't address the feature where we have two caches : local + remote

It is just doing the task of combining multiple files into one, and still using gzip compression (only).

0 replies

afbjorklund · 2019-05-01T13:54:33Z

afbjorklund
May 1, 2019

Passes the "base" tests now, after removing some of the checks on "number of files in cache".

Also added a --dump-result helper for peeking into the cache, maybe it needs to extract too ?

$ ./ccache --dump-result /home/anders/.ccache/4/d/0cdcf7300b6d09c0d8019ef12b9b9c-259.result
Magic: cCcC
File paths (2):
  0: .stderr (0.1 kB)
  1: .o (1.0 kB)

0 replies

jrosdahl · 2019-05-01T22:02:45Z

jrosdahl
May 1, 2019
Maintainer Author

@afbjorklund: I have had a quick look at the on-disk format you prototyped and I think it looks good!

One thing though is that I think that we should start using 64-bit file sizes.

0 replies

jrosdahl · 2019-05-01T22:08:45Z

jrosdahl
May 1, 2019
Maintainer Author

I think I need to take some time to write down a more concrete plan of what I would like to see.

Done in #404. Feel free to comment.

0 replies

afbjorklund · 2019-05-02T16:08:54Z

afbjorklund
May 2, 2019

One thing though is that I think that we should start using 64-bit file sizes.

That is some really big object files, that... They would have to be > 4G ?

0 replies

jrosdahl · 2019-05-02T18:27:37Z

jrosdahl
May 2, 2019
Maintainer Author

They would have to be > 4G ?

Yep. Or source or header files. I don't expect that we will see such large files, but then again, who knows what strange code people will compile in the future? 🙂 Better safe than sorry.

The other option is of course to handle the situation gracefully (i.e. call failed() as soon as we see a too large file), which isn't done today. But storing larges sizes doesn't seem very costly to me. What do you think?

0 replies

afbjorklund · 2019-05-02T18:28:22Z

afbjorklund
May 2, 2019

If we're expanding the hash anyway, expanding size isn't very costly no.

0 replies

jrosdahl · 2019-05-02T19:33:59Z

jrosdahl
May 2, 2019
Maintainer Author

One more thing regarding the on-disk format: I suggest adding a version field to the header. This way the dump function can know if it's a supported format or not.

0 replies

afbjorklund · 2019-05-02T21:13:29Z

afbjorklund
May 2, 2019

OK, will do (similar to the manifest). Also need to fix the file copy, to not do one byte at a time...

0 replies

afbjorklund · 2019-05-02T21:19:23Z

afbjorklund
May 2, 2019

Will squash the current WIP branch, and make a proper PR for better discussion of details.

0 replies

jrosdahl · 2019-05-07T22:05:29Z

jrosdahl
May 7, 2019
Maintainer Author

See #411 for the actual implementation task.

0 replies

Introduce aggregated result files #218

jrosdahl Jan 27, 2018 Maintainer

Replies: 45 comments

jrosdahl Jan 27, 2018 Maintainer Author

jrosdahl Jan 27, 2018 Maintainer Author

jrosdahl Jan 29, 2018 Maintainer Author

jrosdahl Jan 30, 2018 Maintainer Author

jrosdahl Jan 30, 2018 Maintainer Author

jrosdahl Jan 30, 2018 Maintainer Author

jrosdahl Apr 21, 2019 Maintainer Author

jrosdahl Apr 29, 2019 Maintainer Author

jrosdahl May 1, 2019 Maintainer Author

jrosdahl May 1, 2019 Maintainer Author

jrosdahl May 1, 2019 Maintainer Author

jrosdahl May 2, 2019 Maintainer Author

jrosdahl May 2, 2019 Maintainer Author

jrosdahl May 7, 2019 Maintainer Author

jrosdahl
Jan 27, 2018
Maintainer

jrosdahl
Jan 27, 2018
Maintainer Author

jrosdahl
Jan 27, 2018
Maintainer Author

jrosdahl
Jan 29, 2018
Maintainer Author

jrosdahl
Jan 30, 2018
Maintainer Author

jrosdahl
Jan 30, 2018
Maintainer Author

jrosdahl
Jan 30, 2018
Maintainer Author

jrosdahl
Apr 21, 2019
Maintainer Author

jrosdahl
Apr 29, 2019
Maintainer Author

jrosdahl
May 1, 2019
Maintainer Author

jrosdahl
May 1, 2019
Maintainer Author

jrosdahl
May 1, 2019
Maintainer Author

jrosdahl
May 2, 2019
Maintainer Author

jrosdahl
May 2, 2019
Maintainer Author

jrosdahl
May 7, 2019
Maintainer Author