New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SafeFileOutputStream: avoid touching files #1220 #1246
SafeFileOutputStream: avoid touching files #1220 #1246
Conversation
SafeFileOutputStream is used to write Workspace metadata on eclipse exit. This did touch the existing files even when the content stayed the same. With this change the content is prepared in a byte array and flushed to disk if and only if the content changed. eclipse-platform#1220
I want to mention here CachingOutputStream what is quite smart as it uses a read / write buffer that skips all bytes that are the same so no need for a temp file, no need to read it back and compare bytes (what might be an expensive operation). |
byte[] oldContent = Files.readAllBytes(target.toPath()); | ||
byte[] newContent = output.toByteArray(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loads 2 the size of the file in memory. For small files it's fine, but if file has some risk of reaching some MB, it can the be the cause of GC triggering and so on, which in the end would also be annoying.
Would it be possible to compare byte-to-byte, or to use some utility that compares stream without loading them fully in memroy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the related issue: even all files together are <12 MB on a big workspace. There is no need for memory optimizations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mickaelistria this a a classical task for a RandomAccessFile
that can be opened as R/W, so you can read the bytes from the one stream, compare it to the one read by RandomAccessFile
and then skip over if it is the same or write the bytes, this will only rewrite bytes that are changed, one must only take into account that RandomAccessFile
is quite bare, so reading things buffered (e.g. 4096 byte buffer) will improve performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jukzi if you are hunting for seconds a read/compare/write can save you even more seconds, as you need to read the old state anyways, combining the compare with the write is more performant and RandomAccessFile
is even memory mapped if you like ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the related issue: even all files together are <12 MB on a big workspace
Please define "big". We've seen .markers files with ~200 MB. Place them on NFS and have fun :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And on that workspace 200 MB won't bother you too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And on that workspace 200 MB won't bother you too.
Not sure what do you mean, if I say you that it is slow to read 200 MB, and you say I shouldn't bother - what kind of argument is needed to explain that extra read of 200 MB from NFS may double the time needed to write workspace data on slow file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, you are trying to optimize 1 second case and ignore possible regression of minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The memory size complained about was java heap memory. If you are talking about file transfer time you might wonder that this PR can totally eliminate the network transfer, as the file content is normally cached by OS anyway and this PR in that case can prevents any read/write via network if the contents stays the same (which is the normal case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can totally eliminate the network transfer, as the file content is normally cached by OS anyway and
Can and normally indicates that you are not sure that this will be the case and of course no one can guarantee that.
What is however guaranteed, that there will be an OS that would not have caching enabled or the cache will be too small or it will be different virus scanner running as one you've tested and it will prevent OS from caching the data because it will want to read / check every accessed file.
We are talking about N different operating systems with M versions, X different virus scanners with Y versions - and that only if we have data locally (client side), but of course we might have NFS server side with similar combination of various components that would be involved.
With that, I'm pretty sure that this PR will introduce a regression and therefore I'm against merging this PR "as is".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don’t do an extra read operation.
Calling for premature optimization? Those bulk reads/writes are highly optimized in current JDKs any streaming may be slower. |
It is unrelated how fast it is implemented in JDK, if the underlined NFS layer can only deliver 1MB/s. |
Than you could please test if you are not accidentally blocking a PR that improves that performance. |
I've always assumed the contributor who provides a change has to proof that the change does not harm. I've pointed here about use case (big workspaces, many changes, lot of data in .marker file) and also pointed you to the possible regression. I have no time to properly validate every PR. Now I would expect contributor that wants merge this change would try to reproduce and analyse the potentially bad performance for the use case I've provided. If you want merge or test it in SDK without further analysis, I would prefer you would change the PR, add a new system property and keep the behavior "as is" if the property is not set, so this change would not break others by default. But just merging it based on an observation with small data / Windows OS and optimizing it for "virus scanner" use case by ignoring all others who don't work on small workspaces or on Windows is not what I would expect. |
Just as a little proof of concept how one can have a read/compare/write operation (using local file system):
|
just read the whole file at once. ~200ms faster startup for platform workspace.
@iloveeclipse |
I also don't like the idea you are fine introducing regressions, not considering that the change may cause troubles to users with different setup. That is not nice and not inclusive and shouldn't be handled this way in Eclipse platform. The change, if integrated in a custom application that only runs locally on Windows would be fine. But we are not in the custom application that only runs on local file system on Windows. You don't know if there are millions of users working with NFS, this doesn't seem to be very uncommon case for me. |
to avoid read on save when content hash did not change (~50ms slower)
441cb87
to
9f3c426
Compare
I find the tone of the communication here to be disturbing and kind of depressing. 😢 We live in a glass house. We ought to strive to remain constructive and open and, even in the face of frustration and disagreement, give others the benefit of the doubt that they too have the best interests of the project and the community in mind. Please try to be nice. |
I must confess that I fail to understand why a hash computation (and comparing hashes) is better than comparing the bytes in place, if we can compute the has it means we have read both (new and old) at least once, so at this "reading phase" one could trivially compare the bytes. |
Reading happens when you start eclipse. Writing happens when you exit eclipse there can be a long time between. Keeping only the hashes in Memory is very Heap friendly compared to storing the whole file content. |
Okay have missed that, but then why not use a safe has e.g. at least SHA1, the usual hash code only is meant to at best be evenly distributed and equals performs the actual checks. Regardless of used algorithm, one usually would use DigestInput/Outputstreams to compute the hash on-the-fly instead of after the fact. |
Already computing a simple hash is much slower (50ms overhead in my benchmark) then reading all those bytes form the OS cache (which is just a mem copy). There is no need for a slow cryptographic hash that tries to ensures uniqueness. |
I just wanted to mention that I would simply do a compound key, |
In the already very rare of a collision it is already compared bytewise to be sure. Nothing unsafe. |
I'm a bit irritated that this was merged without a approving review as at least two committer have expressed concerns (with even one -1 review) and I don't see the concerns are addressed. Also that we now have 3 intermediate commits makes this quite hard to compare. Especially I don't see the concern of @iloveeclipse addressed here in any way, to explain this I just like to compare the previous state with the current one regarding memory: For the reading case previously there was only streams used with a specific buffer size so memory for reading such a file was at a fixed limit. Now there is no limit and the JVM needs to allocate an array of the size of the file (for case mentioned by @iloveeclipse 200mb). Additionally now for this 200mb we need to compute the hash what adds a runtime overhead (even though we hope that it will outweighs the time need to write an unmodified file but see below). For the writing case (and lets assume the file didn't change as otherwise this change won't make any sense at all) again previously only stream where used with a fixed size buffer, now it needs twice the size of the file (in @iloveeclipse case 400mb) and it needs an additional hash computation of the 200mb file, so the hashing so we have double hash computation penalty (once read once write) and double memory consumption. I'll try to mitigate some of these things, but this has definitely not really addressed the concerns of big files and simply ignored all hints to make it more performant ind if it improves performance seem to depend on if the double hashing is faster than the file write (what might depend on OS and used hard drives / SSDs / ...). |
I too must express my dismay. I had planned to review this in detail to understand better what everyone is trying to say from a technical perspective. Unfortunately this SimRel release week so for me it's kind of the week from hell with yesterday and today being fully loaded. I had expected to have more time for a review and I did not expect the need to commit the change as-is to be an urgent need. As I see it, something drastically needs to change in the communication and the attitudes here. The behavior I see is simply unacceptable and it is beneath even the lowest standard of what I expect from the highly-skilled professionals involved here. I see here frustration boiling over into hostility and then vented in open expressions of hostility directed in a very personal way. Things just escalate at this emotional level, leaving all the technical issues in the dust and forgotten. The end result of this poor process is a sub-optimal commit. When children play like this, we call for a time-out and we separate them. We try to avoid taking sides because of course all problems are someone else's fault. But, as they say, it takes two to tango. |
@jukzi for more controversial changes like this one (which have very early triggered some discussion) which are not blockers, please be patient and leave more time to review before merging. Leaving more time for discussion will either consolidate a consensus or reveal some important issues; both are a win from a project perspective. |
@eclipse-platform/eclipse-platform-project-leads - time to step up |
@jukzi : please don't do this anymore without giving reviewers a chance to review PR again: Additional concerns on the submitted PR:
Regarding the PR merge:
I was asked to act as a project lead.
|
I am concerned about the obvious lies. You did have the mandated time to rereview, you where asked todo so. You just did not. You did have the time to test. You even announced you will not. Read backup file on startup still exists, has not been changed. |
Jörg, let us discuss technical issues, and not go down to this style of communication. This is not OK. I think we all want best for Eclipse, but need to communicate in a way that doesn't offend anyone, and if this communication doesn't work, we should try to understand what prevents us to be successful. Please give reviewers a chance to discuss the change, we could avoid misunderstanding of what the change is supposed to do. That's what the review is needed - we all need to understand the change and if something is unclear or could be made differently, discuss it. If there were concerns expressed, one should not merge PR anyway. |
Based on the concerns raised, I will suggest to use this PR with option to use so that interested users can use and explore and after we have the confidence , in the next release we can make it as default option. |
Technical concerns can be valid, invalid, solved and unsolved. Proven, unproven, tested or untested, important or unimportant. Helpful or ill minded. Don't mix them all together. |
@jukzi : let restart technical discussion on a new PR. |
SafeFileOutputStream is used to write Workspace metadata on eclipse exit. This did touch the existing files even when the content stayed the same. With this change the content is prepared in a byte array and flushed to disk if and only if the content changed.
#1220