-
-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Issue 21593 - Only update file time if file to be written already exists #12169
Conversation
|
Thanks for your pull request, @WalterBright! Bugzilla references
Testing this PR locallyIf you don't have a local development environment setup, you can use Digger to test this PR: dub run digger -- build "master + dmd#12169" |
c106638 to
1842939
Compare
1842939 to
8f69f98
Compare
8f69f98 to
61b3402
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed whether to "touch" the file or not, and decided if it wasn't touched, it would likely break many build systems.
Can you share more about this ? I'm not sure how that would brea build systems.
| // if (!params.oneobj || modi == 0 || m.isDocFile) | ||
| // m.deleteObjFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't comment out code please, remove it
| * Returns: | ||
| * `true` on success | ||
| */ | ||
| extern (D) static bool update(const(char)* namez, const void[] data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| extern (D) static bool update(const(char)* namez, const void[] data) | |
| extern (D) static bool update(const(char)[] name, const void[] data) |
The only call to this function pass an array. That array is then converted to a C-string and passed here, where strlen is called on it. Just make it accept a const(char)[] and call toCStringThen in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the same set of overloads as the corresponding write() uses.
In any case, the code is a mess of repeated conversions to/from:
- charz
- string
- wcharz
Refactoring all that is a task for a different PR. I wanted to keep this as targeted as possible to being what the subject says it is.
Consider this Makefile: program.o : program.d
dmd -c program.dOn first invocation, the compiler compiles The user then makes an insignificant change to the source file, such as by adding a comment, and re-invokes On the second invocation, the compiler recompiles The user then invokes Here, This will affect all timestamp-based build systems, which is most of them. |
Have you checked if this actually leads to a performance gain (i.e. some benchmark)? Operating systems might already be doing this as an optimization. |
That seems hard to tell. I tried deleting files and then writing an identical file, and the two files would have the same inode number. Depending on how the OS's disk cache works, any loop reading/writing files may never touch the disk at all, until some time has passed and the cache gets flushed to disk. If the disk cache is flushed, I don't see how the OS could recognize that a new write is identical to one on the disk. |
|
Cache flushing happens in the background and doesn't block anything, so, unless the system is IO-bound (unlikely for the scenario of compiling D programs?), it's "free". Maybe D users with very large codebases could try this patch and provide feedback. Otherwise, my recommendation would be to not add something for which the improvement cannot even be measured :) |
|
I'd be curious of its effect when someone is repeatedly building a project that takes 15 minutes or more. |
|
I also know that writing code like: can yield speedups in many cases. |
|
This should help SSD disk drive wear level -> last longer by avoiding write cycle? |
No, not in any meaningful way in today's hardware. |
| ReadResult r = read(namez); | ||
| if (!r.success || | ||
| r.buffer.data[] != data[]) | ||
| return write(namez, data); // contents not same, so write new file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not quite optimal.
- The target file will be opened two times, first for reading and then for writing. Ideally it's only open once for reading AND writing. I doubt this makes a measurable difference in practice save for exceptional cases (slow networked drives).
- Reading the entire file before comparing is indeed inefficient, as the comment notes. However, this is only done if the files have the same size so there's a good likelihood they have the same contents.
For a first shot this should be workable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One alternative would be to simply compute a hash of the object file every time it is written and then use that to compare to newer builds (only if the file sizes match). Computing the hash for the new object file should be faster than reading the contents of the old one, however, compared to the block-by-block method proposed by @andralex it will still perform more writes.
The hash can be stored in a file that the compiler will generate for each working directory.
| else version (Posix) | ||
| { | ||
| import core.sys.posix.utime; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's with this newline? this is not Pascal. We don't need empty lines between imports and code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I simply like to group related things.
Do you have evidence for that? I know the paging system never compares a page for equality with the existing page before flushing. I wonder if SSDs do that comparison. In the general case it seems like a low yield thing. |
|
Looking at this review for what seems to be an average SSD drive (my machine also has a Crucial): Burst read: 924.2 MB/s This is interesting; I expected the burst read to be faster than burst write, too. Another article discusses the write process and makes the point in no unclear terms: "The key to maximize the life span of an SSD is to reduce writing to it." |
|
and with day job program writing out like 15 GB of crap when it rebuilds (the final executable itself is like 700 MB) i could see this potentially adding up fast |
No. The file stamp is updated. To the outside, there's no difference. What am I missing? |
|
Also read a few articles (old enough to be difficult to find by now) that they had to change OS implementations to work well on SSDs because they'd write liberally to disk. So this looks like a good thing to do in any case. |
SSD manufacturers publish how much data an SSD will typically endure before beginning to degrade. These numbers are generally very high, so high that someone would need to write a lot of data (many GBs per day) for many years in a row to reach them. That you need to be paranoid about SSD writes in typical use is a myth that was true in the very early days of SSD, decades ago, and it has persistently endured despite the dramatic improvements in the technology.
That is a lot of data. But also, it's not really a typical scenario for D users. If you are doing so much I/O during the build, you should be using a RAM disk anyway, you would have very much to gain from doing so. |
I thought RAM disks were so 1980s, that modern OS disk caching obsoleted them. |
|
At the very least, RAM disks are still definitely useful for applications that insist on using Even without explicit syncing, I believe operating systems' IO schedulers will try to ensure that written data ends up on disk eventually within a certain time frame (though this is "free" as far as DMD is concerned as I mentioned above). |
|
@CyberShadow I very much appreciate your skepticism and call for evidence. Evidence, not supposition, it the real deal as we are empirically-minded, experimental people over here. I mean we can't have a discussion whereby we trade assertions and dismissals back and forth. I am convinced a lot of respect goes around for the competence of everyone involved in this discussion, but argument by authority is not how it works and not how it should work. In that spirit I ran a few tests on a desktop equipped with an SSD. I am not equipped and can't afford the time to test this very PR in-situ against a large project, so I used I ran the following tests tests on a 6.5 GB Windows 10 ISO file (call it To condition the system, I first cleared all caches: echo 3 | sudo tee /proc/sys/vm/drop_cachesThen to test speed of copying I ran: for i in `seq 1 10`; do time cp big $i; doneTo test the speed of comparison I ran: for i in `seq 1 10`; do time cmp -s big $i; doneThe results were interesting.
(Variance is low - lower than what one would typically see in a computational benchmark.) So we're looking at a slight pessimization on a totally cold system (but let's not forget So we're looking at a robust I/O speed improvement in a common case, something worth taking. More importantly, this paves the way to fixing https://issues.dlang.org/show_bug.cgi?id=21596. |
|
One more experiment ran at Walter's behest: time cp big 1
time cmp -s big 1i.e. instead of spreading writes across 10 files, I just write to a file and then immediately read from it. The results are closer to what one might encounter in a edit-build cycle because object files are much smaller than 6.5 GB and therefore likely to be buffered in memory across builds. The |
Sounds good though I don't know why you felt the need to say that.
Great. Thanks for taking initiative in running the numbers. There are some errors in your experiment, but I got a similar outcome in my experiment regardless. The big downside is that things are much slower when the object file has changed (compare + write), which may be a likely case when DMD is writing all code to a single object file. So, perhaps this should only be enabled when D files are compiled to individual object files. Of course, the ultimate test would be to ask someone who would have a lot to benefit from this patch to actually try it on their codebase. @adamdruppe ? |
I was replying to @Geod24's question why the timestamp needed to be updated. |
I think in most cases, a change in object file will come with a change in binary size, won't it ? |
|
Possibly not because of section padding. |
I once ran an SSD-killer service (also known as a time-series database), from what I recall, I can give the following scenario:
From that I can do the following back-of-hand maths and say that:
Just to stress, the SSD did not die. It could still be used for other purposes, just not writing metric data at the volume that was required. (It goes without saying that at some point we switched to zRAM to store the data, and remained there until the 512GB NVMe became a thing, at which point there was enough space available that all files could just be preallocated, reducing the death rate of SSDs to around once every 12-18 months). I hope these approximated figures give you a good idea of how little you need to worry about killing your SSD with compiler builds alone. :-) |
The only case in which the file is loaded unnecessarily is when the sizes are identical to the byte yet the content is different. I don't have data but I speculate that's a rare case so for version 1.0 this should be fine. An improvement is to read the file progressively and stop at the first difference. Since most changes in an object file are likely to change the header, the difference will often be in the first block loaded. That can come in a future PR. |
Thanks, good to know. It seems that good citizenry is not a concern, so I'm glad we have the speed advantage. Again, to me this all is a step toward fixing https://issues.dlang.org/show_bug.cgi?id=21596, which is Moby Dick. |
|
On Tue, Feb 02, 2021 at 09:00:12PM -0800, Vladimir Panteleev wrote:
The big downside is that things are much slower when the object file *has* changed (compare + write)
I imagine you can likely catch most those by a simple file size
comparison. If it is a different size, no need to actually read it,
things have obviously changed.
Odds are in most cases where the object file is the same size, it hasn't
changed and then the read just verifies that assumption.
Of course, the ultimate test would be to ask someone who would have a lot to benefit from this patch to actually try it on their codebase. @adamdruppe ?
Andrei works on the same project so I'm guessing he's already preparing
a test rig :)
|
Yes, that's the idea. Get it working and prove it, then optimize. |
This is a suggestion from Andrei.
Because file writes are so much slower than reads, and because of weak dependency management in build systems, the compiler will often generate the same files again and again. This changes file writes to first see if the file already exists and is identical. If so, it just updates the last written time on that file instead of writing a new one.
We discussed whether to "touch" the file or not, and decided if it wasn't touched, it would likely break many build systems.