Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SafeFileOutputStream: avoid touching files #1220 #1246

Merged
merged 3 commits into from Mar 12, 2024

Conversation

jukzi
Copy link
Contributor

@jukzi jukzi commented Mar 11, 2024

SafeFileOutputStream is used to write Workspace metadata on eclipse exit. This did touch the existing files even when the content stayed the same. With this change the content is prepared in a byte array and flushed to disk if and only if the content changed.

#1220

SafeFileOutputStream is used to write Workspace metadata on eclipse
exit. This did touch the existing files even when the content stayed the
same. With this change the content is prepared in a byte array and
flushed to disk if and only if the content changed.

eclipse-platform#1220
@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

On win with platform workspace this PR reduces eclipse start+stop by >1sec for both save and restore:
before:
image
after:
image

@laeubi
Copy link
Contributor

laeubi commented Mar 11, 2024

I want to mention here CachingOutputStream what is quite smart as it uses a read / write buffer that skips all bytes that are the same so no need for a temp file, no need to read it back and compare bytes (what might be an expensive operation).

Comment on lines 74 to 75
byte[] oldContent = Files.readAllBytes(target.toPath());
byte[] newContent = output.toByteArray();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loads 2 the size of the file in memory. For small files it's fine, but if file has some risk of reaching some MB, it can the be the cause of GC triggering and so on, which in the end would also be annoying.
Would it be possible to compare byte-to-byte, or to use some utility that compares stream without loading them fully in memroy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the related issue: even all files together are <12 MB on a big workspace. There is no need for memory optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mickaelistria this a a classical task for a RandomAccessFile that can be opened as R/W, so you can read the bytes from the one stream, compare it to the one read by RandomAccessFile and then skip over if it is the same or write the bytes, this will only rewrite bytes that are changed, one must only take into account that RandomAccessFile is quite bare, so reading things buffered (e.g. 4096 byte buffer) will improve performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jukzi if you are hunting for seconds a read/compare/write can save you even more seconds, as you need to read the old state anyways, combining the compare with the write is more performant and RandomAccessFile is even memory mapped if you like ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the related issue: even all files together are <12 MB on a big workspace

Please define "big". We've seen .markers files with ~200 MB. Place them on NFS and have fun :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on that workspace 200 MB won't bother you too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And on that workspace 200 MB won't bother you too.

Not sure what do you mean, if I say you that it is slow to read 200 MB, and you say I shouldn't bother - what kind of argument is needed to explain that extra read of 200 MB from NFS may double the time needed to write workspace data on slow file system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, you are trying to optimize 1 second case and ignore possible regression of minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory size complained about was java heap memory. If you are talking about file transfer time you might wonder that this PR can totally eliminate the network transfer, as the file content is normally cached by OS anyway and this PR in that case can prevents any read/write via network if the contents stays the same (which is the normal case).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can totally eliminate the network transfer, as the file content is normally cached by OS anyway and

Can and normally indicates that you are not sure that this will be the case and of course no one can guarantee that.

What is however guaranteed, that there will be an OS that would not have caching enabled or the cache will be too small or it will be different virus scanner running as one you've tested and it will prevent OS from caching the data because it will want to read / check every accessed file.

We are talking about N different operating systems with M versions, X different virus scanners with Y versions - and that only if we have data locally (client side), but of course we might have NFS server side with similar combination of various components that would be involved.

With that, I'm pretty sure that this PR will introduce a regression and therefore I'm against merging this PR "as is".

Copy link
Contributor

github-actions bot commented Mar 11, 2024

Test Results

   636 files  ±0     636 suites  ±0   40m 0s ⏱️ - 3m 53s
 3 946 tests +1   3 923 ✅ +1   23 💤 ±0  0 ❌ ±0 
12 444 runs  +3  12 280 ✅ +3  164 💤 ±0  0 ❌ ±0 

Results for commit 9f3c426. ± Comparison against base commit 893349b.

♻️ This comment has been updated with latest results.

Copy link
Member

@iloveeclipse iloveeclipse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don’t do an extra read operation.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

Please don’t do an extra read operation.

Calling for premature optimization? Those bulk reads/writes are highly optimized in current JDKs any streaming may be slower.

@iloveeclipse
Copy link
Member

Please don’t do an extra read operation.

Calling for premature optimization? Those bulk reads/writes are highly optimized in current JDKs any streaming may be slower.

It is unrelated how fast it is implemented in JDK, if the underlined NFS layer can only deliver 1MB/s.
And not, it is not premature optimization, as we have customers with big workspaces and they use NFS and we don't want double the time needed to save workspace.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

we don't want double the time needed to save workspace.

Than you could please test if you are not accidentally blocking a PR that improves that performance.

@iloveeclipse
Copy link
Member

Than you could please test

I've always assumed the contributor who provides a change has to proof that the change does not harm.

I've pointed here about use case (big workspaces, many changes, lot of data in .marker file) and also pointed you to the possible regression. I have no time to properly validate every PR.

Now I would expect contributor that wants merge this change would try to reproduce and analyse the potentially bad performance for the use case I've provided.

If you want merge or test it in SDK without further analysis, I would prefer you would change the PR, add a new system property and keep the behavior "as is" if the property is not set, so this change would not break others by default.

But just merging it based on an observation with small data / Windows OS and optimizing it for "virus scanner" use case by ignoring all others who don't work on small workspaces or on Windows is not what I would expect.

@laeubi
Copy link
Contributor

laeubi commented Mar 11, 2024

Just as a little proof of concept how one can have a read/compare/write operation (using local file system):

public class FunWithRandomAccessFile {

	public static void main(String[] args) throws IOException {

		Path input = Path.of(args[0]);
		Path output = Path.of(args[1]);
		int blockSize = getBlockSize(output);
		byte[] inputBuffer = new byte[blockSize];
		byte[] outputBuffer = new byte[blockSize];
		try (InputStream inputStream = Files.newInputStream(input); RandomAccessFile ra = new RandomAccessFile(output.toFile(), "rw")) {
			// first truncate or expand the file, this will make sure we don't need bound checks and is much more performant as the OS will allocate only once...
			long newLength = Files.size(input);
			if(newLength != ra.length()) {
				ra.setLength(newLength);
				// position the file pointer at the first postion
				ra.seek(0);
			}
			int read;
			while((read = inputStream.readNBytes(inputBuffer, 0, inputBuffer.length))>-1) {
				ra.readFully(outputBuffer, 0, read);
				if(Arrays.equals(outputBuffer, 0, read, inputBuffer, 0, read)) {
					continue;
				}
				// we found a diff, now we need to reposition the stream...
				ra.seek(ra.getFilePointer() - read);
				// write the differing bytes...
				ra.write(inputBuffer, 0, read);
				// now we have two choices (maybe based on the file size...:
				// 1) we can simply continue hoping that only parts of the file have changed
				// --> continue;
				// 2) we can assume that there is all messed up and simply write all bytes without further checks....
				while((read = inputStream.readNBytes(inputBuffer, 0, inputBuffer.length)) > -1) {
					ra.write(inputBuffer, 0, read);
				}
				break;
			}
		}
	}

	public static int getBlockSize(Path output) {
		try {
			long bs = Files.getFileStore(output).getBlockSize();
			if(bs < 10 * 1024) {
				return (int)bs;
			}
		} catch(IOException e) {
		}
		return 4096;
	}
}

just read the whole file at once.
~200ms faster startup for platform workspace.
@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

@iloveeclipse
I don't like the idea that you block millions of user to have a faster eclipse just because there is no proof that it is faster in your very uncommon setup. However i have adapted the PR to remember the hash of the file content so that the fast path is not taken when content's hash did change. That avoids the additional file read that you fear.

@iloveeclipse
Copy link
Member

I don't like the idea that you block millions of user to have a faster eclipse just because there is no proof that it is faster in your very uncommon setup

I also don't like the idea you are fine introducing regressions, not considering that the change may cause troubles to users with different setup. That is not nice and not inclusive and shouldn't be handled this way in Eclipse platform.

The change, if integrated in a custom application that only runs locally on Windows would be fine. But we are not in the custom application that only runs on local file system on Windows. You don't know if there are millions of users working with NFS, this doesn't seem to be very uncommon case for me.

to avoid read on save when content hash did not change (~50ms slower)
@jukzi jukzi force-pushed the SafeFileOutputStream_notouch branch from 441cb87 to 9f3c426 Compare March 11, 2024 15:37
@merks
Copy link
Contributor

merks commented Mar 11, 2024

I find the tone of the communication here to be disturbing and kind of depressing. 😢 We live in a glass house. We ought to strive to remain constructive and open and, even in the face of frustration and disagreement, give others the benefit of the doubt that they too have the best interests of the project and the community in mind. Please try to be nice.

@laeubi
Copy link
Contributor

laeubi commented Mar 11, 2024

I must confess that I fail to understand why a hash computation (and comparing hashes) is better than comparing the bytes in place, if we can compute the has it means we have read both (new and old) at least once, so at this "reading phase" one could trivially compare the bytes.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

Reading happens when you start eclipse. Writing happens when you exit eclipse there can be a long time between. Keeping only the hashes in Memory is very Heap friendly compared to storing the whole file content.

@laeubi
Copy link
Contributor

laeubi commented Mar 11, 2024

Okay have missed that, but then why not use a safe has e.g. at least SHA1, the usual hash code only is meant to at best be evenly distributed and equals performs the actual checks.

Regardless of used algorithm, one usually would use DigestInput/Outputstreams to compute the hash on-the-fly instead of after the fact.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

Already computing a simple hash is much slower (50ms overhead in my benchmark) then reading all those bytes form the OS cache (which is just a mem copy). There is no need for a slow cryptographic hash that tries to ensures uniqueness.
Suggested reading: cryptographic hash vs fingerprint

@laeubi
Copy link
Contributor

laeubi commented Mar 11, 2024

I just wanted to mention that hashCode is far away from being collision free, a collision would mean you assume the file is not changed but it did. So one has to ask what is more important, even md5 is much safer than hashCode. Also being 'smart' and e.g. adding the length is often not a good idea as you can't prove that not now your hasFunction even has more collisions or bad properties.

I would simply do a compound key, (hash, length), in the write case you can then avoid computing the (expensive) hash all together if the size already differs.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 11, 2024

you assume the file is not changed but it did

In the already very rare of a collision it is already compared bytewise to be sure. Nothing unsafe.

@jukzi jukzi dismissed iloveeclipse’s stale review March 12, 2024 11:26

extra read operation avoided.

@jukzi jukzi merged commit 8911e2d into eclipse-platform:master Mar 12, 2024
13 of 16 checks passed
@jukzi jukzi deleted the SafeFileOutputStream_notouch branch March 12, 2024 15:02
@laeubi
Copy link
Contributor

laeubi commented Mar 13, 2024

I'm a bit irritated that this was merged without a approving review as at least two committer have expressed concerns (with even one -1 review) and I don't see the concerns are addressed. Also that we now have 3 intermediate commits makes this quite hard to compare.

Especially I don't see the concern of @iloveeclipse addressed here in any way, to explain this I just like to compare the previous state with the current one regarding memory:

For the reading case previously there was only streams used with a specific buffer size so memory for reading such a file was at a fixed limit. Now there is no limit and the JVM needs to allocate an array of the size of the file (for case mentioned by @iloveeclipse 200mb).

Additionally now for this 200mb we need to compute the hash what adds a runtime overhead (even though we hope that it will outweighs the time need to write an unmodified file but see below).

For the writing case (and lets assume the file didn't change as otherwise this change won't make any sense at all) again previously only stream where used with a fixed size buffer, now it needs twice the size of the file (in @iloveeclipse case 400mb) and it needs an additional hash computation of the 200mb file, so the hashing so we have double hash computation penalty (once read once write) and double memory consumption.

I'll try to mitigate some of these things, but this has definitely not really addressed the concerns of big files and simply ignored all hints to make it more performant ind if it improves performance seem to depend on if the double hashing is faster than the file write (what might depend on OS and used hard drives / SSDs / ...).

@merks
Copy link
Contributor

merks commented Mar 13, 2024

I too must express my dismay. I had planned to review this in detail to understand better what everyone is trying to say from a technical perspective. Unfortunately this SimRel release week so for me it's kind of the week from hell with yesterday and today being fully loaded. I had expected to have more time for a review and I did not expect the need to commit the change as-is to be an urgent need.

As I see it, something drastically needs to change in the communication and the attitudes here. The behavior I see is simply unacceptable and it is beneath even the lowest standard of what I expect from the highly-skilled professionals involved here. I see here frustration boiling over into hostility and then vented in open expressions of hostility directed in a very personal way. Things just escalate at this emotional level, leaving all the technical issues in the dust and forgotten. The end result of this poor process is a sub-optimal commit. When children play like this, we call for a time-out and we separate them. We try to avoid taking sides because of course all problems are someone else's fault. But, as they say, it takes two to tango.

@mickaelistria
Copy link
Contributor

@jukzi for more controversial changes like this one (which have very early triggered some discussion) which are not blockers, please be patient and leave more time to review before merging. Leaving more time for discussion will either consolidate a consensus or reveal some important issues; both are a win from a project perspective.
I understand that merging sometimes feel relieving as it makes that our work feels complete, but we must refrain from obeying such need for immediate relief if there is still some longer-term discussion going on that can challenge the proposal.
In case of a strong disagreement between committers, merging is not the best solution. The project lead are here to help with taking the best immediate decision and hopefully to facilitate discussion.

@akurtakov
Copy link
Member

@eclipse-platform/eclipse-platform-project-leads - time to step up

@iloveeclipse
Copy link
Member

@jukzi : please don't do this anymore without giving reviewers a chance to review PR again:
image

Additional concerns on the submitted PR:

  1. Original behavior of a "Safe" file based stream with backup was removed, se below.
  2. Original copy read backup file on startup, if that existed (means, Eclipse was not properly terminated). Now it seem to ignore existing temp and starts with an empty data (if I read it right). This would mean workspace changes are lost after Eclipse crash. Original code applied not yet committed temporary workspace changes to the workspace on next startup, if this is now broken, the behavior must be changed to be "safe" again.
  3. The memory consumption for low memory Eclipse configurations might be heavily affected by the original patch, as it uses memory buffer instead of a file stream for storing data (that is addressed by Restore the stream nature of read operation in SafeFileInputStream #1247). Workspaces with many changes over a long period of time might see memory issues on shutdown now.
  4. Related comments in code were not updated, resulting in mismatch between actual code behavior and implemented strategy.

Regarding the PR merge:

  1. Me and others had various concerns expressed and they were only partly addressed.
  2. Review was dismissed without giving a chance to re-review to original reviewer.
  3. No other reviewer had a chance to review updated PR before merge.
  4. The PR touches an essential part of the workspace data storage code, it is not trivial in any way.
  5. There is no technical consensus reached yet.
  6. I see additional issues of this PR not yet discussed, would comment on this later.

I was asked to act as a project lead.

  1. I would propose & merge a PR now that reverts all related commits in question.
  2. I would ask @jukzi to submit a new PR with the changes he wants merge, considering additional feedback.
  3. I would expect that the new PR would need a positive review from at least one other committer (except @jukzi) who has expressed concerns on original change.
  4. I would expect that in the future PR's that have not reached a common agreement would not be merged without consensus.

@iloveeclipse iloveeclipse mentioned this pull request Mar 13, 2024
@jukzi
Copy link
Contributor Author

jukzi commented Mar 13, 2024

I am concerned about the obvious lies. You did have the mandated time to rereview, you where asked todo so. You just did not. You did have the time to test. You even announced you will not. Read backup file on startup still exists, has not been changed.
All concerns have been heard and addressed as good as possible.
From a technical perspective i see no regressions nor possibilities to further improve as the hotspots have now totally shifted from I/O to encoding/decoding and the memory consumption is less then pressing sort in the problems view.
Regarding interpersonal: i am just frustrated about ongoing disrespect against contributions (not only this one) which are hard work including tests, measurements. If you want to make reviews you please take the time to make serious ones including proofs, tests, answering answers. I think the collaboration could be relaxed by meeting in person, which all of you refused :-(
With the current disrespectful behavior i see no way for further cooperation on this topic.

@iloveeclipse
Copy link
Member

I am concerned about the obvious lies.

Jörg, let us discuss technical issues, and not go down to this style of communication. This is not OK.

I think we all want best for Eclipse, but need to communicate in a way that doesn't offend anyone, and if this communication doesn't work, we should try to understand what prevents us to be successful.

Please give reviewers a chance to discuss the change, we could avoid misunderstanding of what the change is supposed to do. That's what the review is needed - we all need to understand the change and if something is unclear or could be made differently, discuss it. If there were concerns expressed, one should not merge PR anyway.

@SarikaSinha
Copy link
Member

Based on the concerns raised, I will suggest to use this PR with option to use so that interested users can use and explore and after we have the confidence , in the next release we can make it as default option.

@jukzi
Copy link
Contributor Author

jukzi commented Mar 13, 2024

Technical concerns can be valid, invalid, solved and unsolved. Proven, unproven, tested or untested, important or unimportant. Helpful or ill minded. Don't mix them all together.
I am always open about constructive reviews, real issues, solving regressions and revert if technically needed. So if anybody can show a regression please do. You can further improve the code? please do!

@iloveeclipse
Copy link
Member

@jukzi : let restart technical discussion on a new PR.
So everyone has a chance to review your code and reach technical agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants