Replies: 45 comments
-
Replying to @afbjorklund's comment in #213 (comment):
I don't see why that would be an issue if a more flexible file format is used (similar to what you sketched further down in #58 (comment))?
I don't think that it will be significantly harder. On the other hand, cache deduplication is very low priority for me since it adds complexity.
I'd be happy to hear more about your ideas here, but spontaneously I'm a bit sceptical. 🙂 Do you mean the actual Git .pack format or just something similar? And should it then be a per-cache file or per-level-1 or per-level 2 file, etc? How should cleanup be done? |
Beta Was this translation helpful? Give feedback.
-
What I meant to say was that the amount of new such file formats, was driving this exact change... I meant the concept of the git pack files, not necessarily the format. Maybe we can use |
Beta Was this translation helpful? Give feedback.
-
Oh, but then you surely mean what to use for the "one single file per result" approach? When you mentioned Git's pack files, I naturally thought you meant that all results in the cache should be put in one or a small number of pack files, since that it is how pack files work in Git. Regarding libarchive: Yes, perhaps! |
Beta Was this translation helpful? Give feedback.
-
I think it might be slightly different concepts, but both are good ideas. Combining the current several files into one file per hash is good, because of all the reasons listed. Combining several hashes into a pack file also has benefits, but needs locking and other complexities... If I recall correctly, git sometimes writes "loose objects" and then combines them ? |
Beta Was this translation helpful? Give feedback.
-
Should also note that we discovered that a side effect / bonus of compression was checksumming... So that is available now, but including it in the combined format would make it available to everyone ? There is also no "fsck" option available, anyway. Suppose you could just call |
Beta Was this translation helpful? Give feedback.
-
Exactly, I'm not sure that combining results in files like that will be a good fit for ccache:
Perhaps it would be possible to do if we have lots of time, but given time constraints I'm inclined to just not think more about it since it seems too hard.
Yes, it writes all new objects to a directory structure similar to ccache's cache directory. Then when a threshold is reached, the loose objects are combined into a pack, which may take a while. And when there are too many packs, the packs will be consolidated into larger packs.
Yes, I actually think that we should use lz4 with its checksumming by default (or always) for the fat files. |
Beta Was this translation helpful? Give feedback.
-
(Sorry if I sound negative. Perhaps there is a relatively easy solution that I just don't see at the moment?) |
Beta Was this translation helpful? Give feedback.
-
Combinining individual files for each result into one sounds like a good start, and easier to do too... I would prefer a different name than "fat files", because I normally associate that with fat objects: And it might be offensive... So better use another name ? |
Beta Was this translation helpful? Give feedback.
-
Could maybe go with the simple format*, and then just do compression on the file level (like before) ? uint32_t suffix_len; // network endian
char suffix[suffix_len];
uint32_t len; // network endian
char data[len]; Assuming that the filename still has the hash. Or maybe add a default suffix to it, to keep DOS happy. $md4hash-$length: |
Beta Was this translation helpful? Give feedback.
-
Brain-storming some potential names: accumulated, aggregated, bunch, bundle, combined, compact, lump, pack (:slightly_smiling_face:), pile On the other hand, in the code the concept perhaps could be just "result" and in the filesystem it could be |
Beta Was this translation helpful? Give feedback.
-
Yes, that's in line with my thoughts as well. If we want to keep supporting hardlinking, the header will need some info saying if the file data is stored in a separate file or not as well. |
Beta Was this translation helpful? Give feedback.
-
Started to work on this feature... There are two additions to start with:
Unlike the memcached format, we don't want to require all the data to be first read into memory. |
Beta Was this translation helpful? Give feedback.
-
Checksumming and (optional) compression will still be done on the individual file level. |
Beta Was this translation helpful? Give feedback.
-
Thanks for starting looking into this! I would much appreciate a more detailed explanation of your design and plans when you have time to do so. Should we maybe spin up a Here are some open questions that I think need to be decided before an implementation can be finalized or maybe even started for real:
I already have opinions on the above of course, but I'm curious to know what you think since you started working on the feature. Comments on what you wrote so far:
Why would the aggregated file need to store paths (to individual result files, I assume)?
Great, yes, it needs to be streamable.
I didn't quite follow you here. Could you expand on when the code would need to seek into the file?
Why do you prefer that instead of compressing the whole file? |
Beta Was this translation helpful? Give feedback.
-
Basically what I wanted was something that would drop into the current framework... As for the individual files, those are for aggregating the inputs. As in: add_cache_file(cache, output_obj, ".o"); The offsets were collected when preparing to read the files from the aggregated file. get_cache_file(cache, output_obj, ".o"); Compressing individual files is to stay compatible, and to avoid temporary storage. |
Beta Was this translation helpful? Give feedback.
-
Whoa there, hold your horses... OK, so you're talking about needs that future backends might have. I don't see why we should embed bookkeeping information like that into the result files. In my mind, that should be backend-specific metadata; for a plain filesystem: mtime and size, for (say) Redis: standard LRU eviction configuration, etc. Rationale for not keeping a "recently used timestamp" inside the result object: It will be a performance killer for cleanup. Such information needs to be stored in a more efficient manner.
When you wrote '"simple" 0x31d6cfe0d16ae931b73c59d7e0c089c00000 (network byte order)', I naturally assumed you meant a binary representation because you wrote "0x" and "network byte order". That could make sense for backends that can use any binary string as the key (but you can't use that for filesystem names, of course, since bytes 0x00 and 0x2f aren't allowed). But now it seems you mean going from one human-readable format to another human-readable format... Or not, since you posted PR #399? OK, so by "the same system", you maybe meant "binary where possible in the future, but a new human-readable format for file systems now"? You continue to confuse me... 🙂
I'm not sure if I understand what you mean by this. If you mean that refactoring needs to be done when introducing a generic backend API, yes, sure. If you mean that something else, please expand. I think that there are a bit too many topics in the discussion mix, many of which depend on each other. We need to come up with a way to divide the work into parts that can be discussed relatively independently. Another problem is maybe that I actually already have thought quite a lot about many of the topics in the discussion during the last two years or so, but haven't had time to write down those thoughts and conclusions. My suggestion is: let's focus on the result file content in this issue. How lookup based on a key (currently hash+size) in the current filesystem backend should work, how a future generic backend API could look like, etc, are all very interesting topics, but I don't think they need much discussion when in comes to designing the result file content format. Sounds reasonable? |
Beta Was this translation helpful? Give feedback.
-
Just meant that it needs to go through the storage backend, so it can decide. It's not really inside the result object, just kept next to it (like you mention). Right now it's in the file stat.
It is "simple" because it is one key rather than two (hash + size), but it is more complex since it is binary - the representation I wrote was made up - just easier to type than
It was not going to be backwards compatible, especially not if changing compression/checksum.
I don't really care about discoverability or file storage, those are already covered by today's system...
It moves keys from the current 42-53 bytes down to 20 bytes, and avoids the dynamic allocation.
Storing the manifest files are easy, we already did that for the memcached backend. But it is also possible to write them directly to memory without going through the file API, should it be needed (later on).
Yeah, I would move the storage backends and such to a different discussion.
Most of my thoughts are in the code, but most of the code is not being merged.
Sure thing. I already posted a code skeleton, which was most based on the .manifest format. I can rename those .cache files to .result, and then remove the individual storage/retrieval of files. I was mostly interested in improving the memcached code, and this story is just 1/5 (says #58):
Adding compression type into this particular story is not needed either, it can continue with gzip. |
Beta Was this translation helpful? Give feedback.
-
Great! Then it sounds like we're on the same page regarding result file content. Let's discuss the backend API later.
Yes, changing anything that affects the key calculation or key format obviously invalidates pre-existing results stored in the cache. That's nothing strange and has happened many times during the years, for instance when adding new hash material. But what I'm talking about is backward incompatibility when it comes to cleanup. Right now, it works OK enough to run different incompatible versions of ccache against the same cache directory since a) statistics counters are kept in the same place (and with the same format and semantics), and b) the cache directory structure is similar enough. If we change this, then the different ccache versions won't see each others files and size statistics anymore. We then have to decide if we should live with this likely confusing situation or do something about it. In any case, we need to make a conscious decision. Makes sense?
I don't understand what you mean. Isn't changes to today's system what we are discussing?
I appreciate that you want to make things happen so that a memcached backend can be realized. Sorry if I'm applying resistance, but I strongly feel that we need to do things in a certain way and order for me to be able to keep up with the development and continue maintaining the code base. I think I need to take some time to write down a more concrete plan of what I would like to see. The bullets you refer to in #58 (comment) are mostly handwaving. |
Beta Was this translation helpful? Give feedback.
-
What I mean is that we already have had that for a while, I want to try something different (i.e. make the cache internals opaque to the user, and then optionally expose them using helpers where needed) The compability (with clean and such) is a problem, but maybe it could be "fixed" by moving into a separate directory such as Updated the sample code, which is just forking the code to see which parts are affected by this. Two of the current optimizations in the source code were affected by the introduced abstration:
As per above, this leaves the aggregated cached_result files using mtime and unlink like before. |
Beta Was this translation helpful? Give feedback.
-
Also, the current tweak doesn't address the feature where we have two caches : local + remote It is just doing the task of combining multiple files into one, and still using gzip compression (only). |
Beta Was this translation helpful? Give feedback.
-
Passes the "base" tests now, after removing some of the checks on "number of files in cache". Also added a $ ./ccache --dump-result /home/anders/.ccache/4/d/0cdcf7300b6d09c0d8019ef12b9b9c-259.result
Magic: cCcC
File paths (2):
0: .stderr (0.1 kB)
1: .o (1.0 kB) |
Beta Was this translation helpful? Give feedback.
-
@afbjorklund: I have had a quick look at the on-disk format you prototyped and I think it looks good! One thing though is that I think that we should start using 64-bit file sizes. |
Beta Was this translation helpful? Give feedback.
-
Done in #404. Feel free to comment. |
Beta Was this translation helpful? Give feedback.
-
That is some really big object files, that... They would have to be > 4G ? |
Beta Was this translation helpful? Give feedback.
-
Yep. Or source or header files. I don't expect that we will see such large files, but then again, who knows what strange code people will compile in the future? 🙂 Better safe than sorry. The other option is of course to handle the situation gracefully (i.e. call |
Beta Was this translation helpful? Give feedback.
-
If we're expanding the hash anyway, expanding size isn't very costly no. |
Beta Was this translation helpful? Give feedback.
-
One more thing regarding the on-disk format: I suggest adding a version field to the header. This way the dump function can know if it's a supported format or not. |
Beta Was this translation helpful? Give feedback.
-
OK, will do (similar to the manifest). Also need to fix the file copy, to not do one byte at a time... |
Beta Was this translation helpful? Give feedback.
-
Will squash the current WIP branch, and make a proper PR for better discussion of details. |
Beta Was this translation helpful? Give feedback.
-
See #411 for the actual implementation task. |
Beta Was this translation helpful? Give feedback.
-
Idea: We could introduce "aggregated result files" (other name suggestions welcome) that store all of a cached result in one file instead of in one to seven files as done on master at the time of writing.
This idea has previously been mentioned on the mailing list and in #58 (comment) and #213 (comment).
Currently, a cached result is stored in the cache in the form
$md4hash-$length.o
,$md4hash-$length.stderr
,$md4hash-$length.d
,$md4hash-$length.gcno
,$md4hash-$length.su
,$md4hash-$length.dia
and$md4hash-$length.dwo
. This has a couple of advantages:There are advantages to storing the result in a single file, though:
stat
s need to be made for the cleanup algorithm's LRU behavior, etc.Beta Was this translation helpful? Give feedback.
All reactions