Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Issue 21593 - Only update file time if file to be written already exists #12169

Merged
merged 1 commit into from
Feb 3, 2021

Conversation

WalterBright
Copy link
Member

This is a suggestion from Andrei.

Because file writes are so much slower than reads, and because of weak dependency management in build systems, the compiler will often generate the same files again and again. This changes file writes to first see if the file already exists and is identical. If so, it just updates the last written time on that file instead of writing a new one.

We discussed whether to "touch" the file or not, and decided if it wasn't touched, it would likely break many build systems.

@dlang-bot
Copy link
Contributor

dlang-bot commented Jan 30, 2021

Thanks for your pull request, @WalterBright!

Bugzilla references

Auto-close Bugzilla Severity Description
21593 enhancement Only update file time if file to be written already exists

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "master + dmd#12169"

@WalterBright WalterBright changed the title add File.update() Fix Issue 21593 - Only update file time if file to be written already exists Jan 30, 2021
Copy link
Member

@Geod24 Geod24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed whether to "touch" the file or not, and decided if it wasn't touched, it would likely break many build systems.

Can you share more about this ? I'm not sure how that would brea build systems.

Comment on lines +489 to +490
// if (!params.oneobj || modi == 0 || m.isDocFile)
// m.deleteObjFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't comment out code please, remove it

* Returns:
* `true` on success
*/
extern (D) static bool update(const(char)* namez, const void[] data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
extern (D) static bool update(const(char)* namez, const void[] data)
extern (D) static bool update(const(char)[] name, const void[] data)

The only call to this function pass an array. That array is then converted to a C-string and passed here, where strlen is called on it. Just make it accept a const(char)[] and call toCStringThen in here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the same set of overloads as the corresponding write() uses.

In any case, the code is a mess of repeated conversions to/from:

  1. charz
  2. string
  3. wcharz

Refactoring all that is a task for a different PR. I wanted to keep this as targeted as possible to being what the subject says it is.

@CyberShadow
Copy link
Member

Can you share more about this ? I'm not sure how that would brea build systems.

Consider this Makefile:

program.o : program.d
	dmd -c program.d

On first invocation, the compiler compiles program.d and creates program.o. So far so good.

The user then makes an insignificant change to the source file, such as by adding a comment, and re-invokes make.

On the second invocation, the compiler recompiles program.d, but notices that program.o is identical to what it would have written, and doesn't touch the file. Still, so far so good.

The user then invokes make for a third time without making any changes to any source files.

Here, make sees that program.d is newer than program.o and re-invokes the compiler, even though program.o already represents the latest copy of program.d. This is where the problems start.

This will affect all timestamp-based build systems, which is most of them.

@CyberShadow
Copy link
Member

Because file writes are so much slower than reads, and because of weak dependency management in build systems, the compiler will often generate the same files again and again. This changes file writes to first see if the file already exists and is identical. If so, it just updates the last written time on that file instead of writing a new one.

Have you checked if this actually leads to a performance gain (i.e. some benchmark)? Operating systems might already be doing this as an optimization.

@WalterBright
Copy link
Member Author

Operating systems might already be doing this as an optimization.

That seems hard to tell. I tried deleting files and then writing an identical file, and the two files would have the same inode number. Depending on how the OS's disk cache works, any loop reading/writing files may never touch the disk at all, until some time has passed and the cache gets flushed to disk.

If the disk cache is flushed, I don't see how the OS could recognize that a new write is identical to one on the disk.

@CyberShadow
Copy link
Member

Cache flushing happens in the background and doesn't block anything, so, unless the system is IO-bound (unlikely for the scenario of compiling D programs?), it's "free".

Maybe D users with very large codebases could try this patch and provide feedback. Otherwise, my recommendation would be to not add something for which the improvement cannot even be measured :)

@WalterBright
Copy link
Member Author

I'd be curious of its effect when someone is repeatedly building a project that takes 15 minutes or more.

@WalterBright
Copy link
Member Author

I also know that writing code like:

if (a[i] != x) a[i] = x;

can yield speedups in many cases.

@apz28
Copy link

apz28 commented Jan 30, 2021

This should help SSD disk drive wear level -> last longer by avoiding write cycle?

@CyberShadow
Copy link
Member

This should help SSD disk drive wear level -> last longer by avoiding write cycle?

No, not in any meaningful way in today's hardware.

ReadResult r = read(namez);
if (!r.success ||
r.buffer.data[] != data[])
return write(namez, data); // contents not same, so write new file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite optimal.

  • The target file will be opened two times, first for reading and then for writing. Ideally it's only open once for reading AND writing. I doubt this makes a measurable difference in practice save for exceptional cases (slow networked drives).
  • Reading the entire file before comparing is indeed inefficient, as the comment notes. However, this is only done if the files have the same size so there's a good likelihood they have the same contents.

For a first shot this should be workable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One alternative would be to simply compute a hash of the object file every time it is written and then use that to compare to newer builds (only if the file sizes match). Computing the hash for the new object file should be faster than reading the contents of the old one, however, compared to the block-by-block method proposed by @andralex it will still perform more writes.

The hash can be stored in a file that the compiler will generate for each working directory.

else version (Posix)
{
import core.sys.posix.utime;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's with this newline? this is not Pascal. We don't need empty lines between imports and code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply like to group related things.

@andralex
Copy link
Member

This should help SSD disk drive wear level -> last longer by avoiding write cycle?

No, not in any meaningful way in today's hardware.

Do you have evidence for that? I know the paging system never compares a page for equality with the existing page before flushing. I wonder if SSDs do that comparison. In the general case it seems like a low yield thing.

@andralex
Copy link
Member

andralex commented Jan 30, 2021

Looking at this review for what seems to be an average SSD drive (my machine also has a Crucial):

Burst read: 924.2 MB/s
Sustained read: 1170.8 MB/s
Burst write: 1605.9 MB/s
Sustained write: 300.3 MB/s

This is interesting; I expected the burst read to be faster than burst write, too.

Another article discusses the write process and makes the point in no unclear terms: "The key to maximize the life span of an SSD is to reduce writing to it."

@adamdruppe
Copy link
Contributor

and with day job program writing out like 15 GB of crap when it rebuilds (the final executable itself is like 700 MB) i could see this potentially adding up fast

@andralex
Copy link
Member

This will affect all timestamp-based build systems, which is most of them.

No. The file stamp is updated. To the outside, there's no difference. What am I missing?

@andralex
Copy link
Member

Also read a few articles (old enough to be difficult to find by now) that they had to change OS implementations to work well on SSDs because they'd write liberally to disk. So this looks like a good thing to do in any case.

@CyberShadow
Copy link
Member

Do you have evidence for that?

SSD manufacturers publish how much data an SSD will typically endure before beginning to degrade. These numbers are generally very high, so high that someone would need to write a lot of data (many GBs per day) for many years in a row to reach them.

That you need to be paranoid about SSD writes in typical use is a myth that was true in the very early days of SSD, decades ago, and it has persistently endured despite the dramatic improvements in the technology.

and with day job program writing out like 15 GB of crap when it rebuilds (the final executable itself is like 700 MB) i could see this potentially adding up fast

That is a lot of data. But also, it's not really a typical scenario for D users. If you are doing so much I/O during the build, you should be using a RAM disk anyway, you would have very much to gain from doing so.

@WalterBright
Copy link
Member Author

you should be using a RAM disk anyway

I thought RAM disks were so 1980s, that modern OS disk caching obsoleted them.

@CyberShadow
Copy link
Member

CyberShadow commented Jan 31, 2021

At the very least, RAM disks are still definitely useful for applications that insist on using fsync / fdatasync / O_SYNC. (That, or hacks like libeatmydata.) Some applications do this inadvertently through a library, for example SQLite uses fsync by default to satisfy its ACID guarantees, and lots of applications use SQLite under the hood.

Even without explicit syncing, I believe operating systems' IO schedulers will try to ensure that written data ends up on disk eventually within a certain time frame (though this is "free" as far as DMD is concerned as I mentioned above).

@andralex
Copy link
Member

andralex commented Feb 2, 2021

@CyberShadow I very much appreciate your skepticism and call for evidence. Evidence, not supposition, it the real deal as we are empirically-minded, experimental people over here. I mean we can't have a discussion whereby we trade assertions and dismissals back and forth. I am convinced a lot of respect goes around for the competence of everyone involved in this discussion, but argument by authority is not how it works and not how it should work.

In that spirit I ran a few tests on a desktop equipped with an SSD. I am not equipped and can't afford the time to test this very PR in-situ against a large project, so I used cp and cmp commands as proxies to assess file copying speed vs. file reading and comparison speed. Note that cmp does more work than dmd because it reads two files, whereas dmd already has an in-memory buffer and only needs to read one file.

I ran the following tests tests on a 6.5 GB Windows 10 ISO file (call it big) on a Unix desktop with an SSD drive (old, but top of the line a few years ago).

To condition the system, I first cleared all caches:

echo 3 | sudo tee /proc/sys/vm/drop_caches

Then to test speed of copying I ran:

for i in `seq 1 10`; do time cp big $i; done

To test the speed of comparison I ran:

for i in `seq 1 10`; do time cmp -s big $i; done

The results were interesting.

  • First (cold) cp: 31s
  • First (cold) cmp: 33s
  • Subsequent (warm) cp: 14.9s
  • Subsequent (warm) cmp: 11.4s

(Variance is low - lower than what one would typically see in a computational benchmark.)

So we're looking at a slight pessimization on a totally cold system (but let's not forget cmp loads twice as much data in compared to dmd), and a 23.5% improvement on a warm system. It is often the case the system is warm when repeated edit-build cycles are done because the object file produced in one build is likely to be cached (subject to available RAM) during the next.

So we're looking at a robust I/O speed improvement in a common case, something worth taking. More importantly, this paves the way to fixing https://issues.dlang.org/show_bug.cgi?id=21596.

@andralex
Copy link
Member

andralex commented Feb 2, 2021

One more experiment ran at Walter's behest:

time cp big 1
time cmp -s big 1

i.e. instead of spreading writes across 10 files, I just write to a file and then immediately read from it. The results are closer to what one might encounter in a edit-build cycle because object files are much smaller than 6.5 GB and therefore likely to be buffered in memory across builds.

The cp time is the same (31s cold, 14.9s warm) but the cmp time drops dramatically under 3s. So we're looking at a very significant improvement in this scenario. Not sure how it maps to various workflows, but I thought I'd share for completeness.

@CyberShadow
Copy link
Member

@CyberShadow I very much appreciate your skepticism and call for evidence. Evidence, not supposition, it the real deal as we are empirically-minded, experimental people over here. I mean we can't have a discussion whereby we trade assertions and dismissals back and forth. I am convinced a lot of respect goes around for the competence of everyone involved in this discussion, but argument by authority is not how it works and not how it should work.

Sounds good though I don't know why you felt the need to say that.

So we're looking at a slight pessimization on a totally cold system (but let's not forget cmp loads twice as much data in compared to dmd), and a 23.5% improvement on a warm system.

Great. Thanks for taking initiative in running the numbers. There are some errors in your experiment, but I got a similar outcome in my experiment regardless.

The big downside is that things are much slower when the object file has changed (compare + write), which may be a likely case when DMD is writing all code to a single object file. So, perhaps this should only be enabled when D files are compiled to individual object files.

Of course, the ultimate test would be to ask someone who would have a lot to benefit from this patch to actually try it on their codebase. @adamdruppe ?

@CyberShadow
Copy link
Member

This will affect all timestamp-based build systems, which is most of them.

No. The file stamp is updated. To the outside, there's no difference. What am I missing?

I was replying to @Geod24's question why the timestamp needed to be updated.

@Geod24
Copy link
Member

Geod24 commented Feb 3, 2021

The big downside is that things are much slower when the object file has changed (compare + write)

I think in most cases, a change in object file will come with a change in binary size, won't it ?

@CyberShadow
Copy link
Member

Possibly not because of section padding.

@ibuclaw
Copy link
Member

ibuclaw commented Feb 3, 2021

Looking at this review for what seems to be an average SSD drive (my machine also has a Crucial):

I once ran an SSD-killer service (also known as a time-series database), from what I recall, I can give the following scenario:

  • ~2 million files
  • Appended to every 30 seconds
  • Each new write added 8 bytes to the file size (or a multiple of 8 bytes if it took > 1min to persist the in-memory cache to disk)
  • Each file kept 3-4 days worth of data before being truncated (old data "rolled up" into a lossy data file)
  • Because the size of the metric data was not preallocated, each write would force a new inode.
  • 2x (250GB) SSDs in RAID0 configuration (sometimes Intel, sometimes Samsung, always was server-grade)
  • After around 6 months (let's say 180 days to keep things simple), at least one disk would degrade to the point that it could no longer keep up with the writes.

From that I can do the following back-of-hand maths and say that:

  • Each metric file had 2880 writes a day (plus 1 to truncate it, plus 3 for the re-aggregating of lossy archives)
  • Size of each file fluctuated between 67.5kb and 90kb (each individual write was between these sizes)
  • In total, there were ~5.768 Billion writes daily (~66,666 per second)
  • ...amounting to ~422.44TB of data (~5GB per second, though only ~16MB of that was new data per second)
  • Degradation of a disk occurred after ~1038.24 Billion writes (>1Tr)
  • ...and ~74.2PB of data.

Just to stress, the SSD did not die. It could still be used for other purposes, just not writing metric data at the volume that was required.

(It goes without saying that at some point we switched to zRAM to store the data, and remained there until the 512GB NVMe became a thing, at which point there was enough space available that all files could just be preallocated, reducing the death rate of SSDs to around once every 12-18 months).

I hope these approximated figures give you a good idea of how little you need to worry about killing your SSD with compiler builds alone. :-)

@andralex
Copy link
Member

andralex commented Feb 3, 2021

The big downside is that things are much slower when the object file has changed (compare + write)

The only case in which the file is loaded unnecessarily is when the sizes are identical to the byte yet the content is different. I don't have data but I speculate that's a rare case so for version 1.0 this should be fine.

An improvement is to read the file progressively and stop at the first difference. Since most changes in an object file are likely to change the header, the difference will often be in the first block loaded. That can come in a future PR.

@andralex
Copy link
Member

andralex commented Feb 3, 2021

I hope these approximated figures give you a good idea of how little you need to worry about killing your SSD with compiler builds alone. :-)

Thanks, good to know. It seems that good citizenry is not a concern, so I'm glad we have the speed advantage. Again, to me this all is a step toward fixing https://issues.dlang.org/show_bug.cgi?id=21596, which is Moby Dick.

@adamdruppe
Copy link
Contributor

adamdruppe commented Feb 3, 2021 via email

@andralex andralex merged commit 97399cf into dlang:master Feb 3, 2021
@WalterBright
Copy link
Member Author

An improvement is to read the file progressively and stop at the first difference. Since most changes in an object file are likely to change the header, the difference will often be in the first block loaded. That can come in a future PR.

Yes, that's the idea. Get it working and prove it, then optimize.

@WalterBright WalterBright deleted the fileUpdate branch February 5, 2021 09:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants