Skip to content

Sync file system on temp file before moving it.#791

Merged
asfgit merged 2 commits intoapache:masterfrom
geomacy:brooklyn-526
Sep 11, 2017
Merged

Sync file system on temp file before moving it.#791
asfgit merged 2 commits intoapache:masterfrom
geomacy:brooklyn-526

Conversation

@geomacy
Copy link
Contributor

@geomacy geomacy commented Aug 4, 2017

See https://issues.apache.org/jira/browse/BROOKLYN-526.

This is a suggestion for a fix. Not sure how to go about testing this!

@geomacy
Copy link
Contributor Author

geomacy commented Aug 4, 2017

Interesting;

Test Result (2 failures / -15)
org.apache.brooklyn.policy.loadbalancing.LoadBalancingPolicySoakTest.testLoadBalancingManyItemsQuickTest
org.apache.brooklyn.policy.loadbalancing.LoadBalancingPolicySoakTest.testLoadBalancingQuickTest

These run ok on my laptop. This may be turning up because of the comparatively restricted capability of the build machine. Retest this please.

@aledsage
Copy link
Contributor

aledsage commented Aug 4, 2017

@geomacy Very cool - I didn't know about that!

Looking at the javadoc (https://docs.oracle.com/javase/8/docs/api/java/io/FileDescriptor.html#sync--):

Throws SyncFailedException - Thrown when the buffers cannot be flushed, or because the system cannot guarantee that all the buffers have been synchronized with physical media.

That suggests there might be two causes:

  1. "cannot guarantee" because of limitations of the physical media? (we should catch + continue); or
  2. io problems (e.g. a network drive that is temporarily unavailable - we should retry, with exponential backoff perhaps?

I guess propagating the exception (as you're doing) without executing move is probably right - we don't want to replace the old file with a file that might not have all the data in it.

Do we go with this code, because it's an improvement; and then later come back to (2) for retries?
Or do we worry about (1), where this might entirely break on some OSs/disks?

Thoughts?

@geomacy geomacy force-pushed the brooklyn-526 branch 3 times, most recently from 47ce4c1 to b3f3ce7 Compare August 9, 2017 11:22
@geomacy
Copy link
Contributor Author

geomacy commented Aug 9, 2017

@aledsage very good questions!

I agree it's best to propagate the exception if it happens - if it's due to a scenario like (1) where the FS can't guarantee the sync, then it's the wrong filesystem for doing persistence. A retry is probably a good idea - have added one in latest commit, deliberately keeping it very simple, to avoid possibly introducing other issues. Let me know what you think.

I had a look at testing this but can't think of a way to do it easily. You'd either have to have a file system that you could force to fail sync on demand (something using FUSE?), or mock out the sync itself - I looked into using a dynamic proxy, but FileDescriptor is a static final class, so not suitable for proxying. Would be glad of any suggestions.

@aledsage
Copy link
Contributor

@geomacy not convinced it's worth the effort of trying to mock the FileDescriptor to simulate a SyncFailedException. If we were doing more than retry + propagate, then maybe it would. But the current logic is pretty simple.

One can mock final classes with Mockito v2 (we're on v2.7.12 currently): https://stackoverflow.com/a/40018295/1393883, but you need to add a config file to change the MockMaker.

I think this is good as-is.

Have you tried running the integration tests, as well as the unit tests?

@geomacy
Copy link
Contributor Author

geomacy commented Aug 16, 2017

I haven't run the integration tests, I shall try to make some time to do so.

@geomacy
Copy link
Contributor Author

geomacy commented Aug 18, 2017

rebased against master, still haven't got round to integration tests

@geomacy
Copy link
Contributor Author

geomacy commented Sep 7, 2017

@aledsage finally got round to checking this against the integration tests.

It doesn't seem to have caused any significant problems - with a run of the tests against current master I got 7 failures:

grep "^<testng-results" {core,camp/camp-brooklyn,launcher}/target/surefire-reports/testng-results.xml
core/target/surefire-reports/testng-results.xml:<testng-results skipped="0" failed="3" ignored="2230" total="4492" passed="2259">
camp/camp-brooklyn/target/surefire-reports/testng-results.xml:<testng-results skipped="0" failed="3" ignored="755" total="806" passed="48">
launcher/target/surefire-reports/testng-results.xml:<testng-results skipped="0" failed="1" ignored="81" total="101" passed="19">

With this PR I actually got fewer:

grep "^<testng-results" {*,*/*}/target/surefire-reports/testng-results.xml | fgrep -v 'failed="0"'
test-framework/target/surefire-reports/testng-results.xml:<testng-results skipped="0" failed="1" ignored="96" total="119" passed="22">
camp/camp-brooklyn/target/surefire-reports/testng-results.xml:<testng-results skipped="0" failed="3" ignored="753" total="804" passed="48">

Think this should be benign enough to merge.

@ahgittin
Copy link
Contributor

ahgittin commented Sep 7, 2017

@geomacy @aledsage I hadn't realized this either; javadocs and online searches suggest the fd.sync() may be needed.

i'm curious whether this really is the issue -- at least on Mac -- as i've done some testing and the sync seems to be very fast whereas write is what takes the time. this is true even under contention: writes contend with each other (8 files written concurrently take 8+ times as long as writing 1 file; if the OS cache is in play i'd expect CPU contention which with 4 cores we'd expect 8 files to take twice as long, which is what happens if we do a System.arraycopy instead of a write) and if there's background disk activity it also slows down substantially. numbers and code below.

worth having this i agree, based on what i've read, though i have one worry -- if sync is done after the close, i always get the SyncFailedException; and it looks like the code in this PR is synching after close. we might want to understand that before merging?

i also came across the code at https://github.com/apache/activemq-apollo/blob/trunk/apollo-util/src/main/scala/org/apache/activemq/apollo/util/IOHelper.java#L272 : here to get sync they included some custom JNI calls. this makes me wonder whether -- although things acted nicely in my tests -- there aren't actually any strong guarantees with java+mac, and you need that type of low-level library?

although another part of me wonders if https://issues.apache.org/jira/browse/BROOKLYN-526 was in fact caused by something completely different (disk corruption?).


one thread:

t0 took 1.390 seconds to write 1,048,576,000 bytes to file
t0 took 0.010 seconds to sync
t0 took 0.000 seconds to close

8 threads:

t7 took 13.180 seconds to write 1,048,576,000 bytes to file
t7 took 0.099 seconds to sync
t7 took 0.000 seconds to close

code:

public class FileWrite {

    public static void main(String[] args) throws Exception {
        write("warmup");
        int NUM_THREADS = 8;
        for (int x=0; x<NUM_THREADS; x++) {
            final String id = "t"+x;
            new Thread() {
                public void run() {
                    try {
                        write(id);
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                }
            }.start();
        }
    }
    
    public static void write(String id) throws Exception {
        Stopwatch t = Stopwatch.createStarted();
        
        File file = File.createTempFile("deleteme", ".dat"); // put your file here.
        System.out.printf("%s took %.3f seconds to create %s%n", id, t.elapsed(TimeUnit.NANOSECONDS) / 1e9, file.getAbsolutePath()); t.reset(); t.start();
        
        FileOutputStream fos = new FileOutputStream(file);
        byte[] bytes = new byte[32 * 1024];
        // byte[] bytes2 = new byte[32 * 1024];
        for (long l = 0; l < 1000 * 1024 * 1024; l += bytes.length) { 
            fos.write(bytes);
            // System.arraycopy(bytes, 0, bytes2, 0, 32 * 1024);
        }
        System.out.printf("%s took %.3f seconds to write %,d bytes to file%n", id, t.elapsed(TimeUnit.NANOSECONDS) / 1e9, file.length()); t.reset(); t.start();
        
        fos.getFD().sync();
        System.out.printf("%s took %.3f seconds to sync%n", id, t.elapsed(TimeUnit.NANOSECONDS) / 1e9); t.reset(); t.start();
        
        fos.close();
        System.out.printf("%s took %.3f seconds to close%n", id, t.elapsed(TimeUnit.NANOSECONDS) / 1e9); t.reset(); t.start();
        
        File file2 = File.createTempFile("deleteme", ".dat"); // put your file here.
        java.nio.file.Files.move(file.toPath(), file2.toPath(), StandardCopyOption.ATOMIC_MOVE);
        System.out.printf("%s took %.3f seconds to move to %s%n", id, t.elapsed(TimeUnit.NANOSECONDS) / 1e9, file2.getAbsolutePath()); t.reset(); t.start();
    }
    
}

@geomacy
Copy link
Contributor Author

geomacy commented Sep 8, 2017

You're right @ahgittin, the sync should be done before the close; I have added another commit to do this.

I agree I'm not sure that the problem was caused by a failure to commit data to disk when @aledsage's laptop crashed, but it's certainly plausible. That's an interesting link you shared above. I've done a bit more digging and come across a very interesting email here:

https://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html

It's definitely worth reading; in short, it looks like just adding the sync() in Brooklyn, while it may help, isn't in itself enough to ensure the data is committed to disk on pretty much any system, not just OSX. :-(

In fact the JNI call that you referred to on OSX seems to be the only way to guarantee the data is committed, and of course it's only available on the Mac.

I think this PR may still be something worth doing (it will eliminate one failure point, by ensuring the data is flushed from host memory to the disk device). I don't know what else we might want to do to take this further, would be interested to hear what you think.

@ahgittin
Copy link
Contributor

ahgittin commented Sep 8, 2017

very interesting link @geomacy - the stronger guarantee of fcntl() probably what the activemq jni code does.

i think this PR is good and your fix here for the problem caused by the also-needed #809 looks good. i think merge though let's give @aledsage a few days to respond.

as for any remaining risk ... the documentation for sync is unambiguous (and the post is old) so i tend towards thinking this does what the API says, synching data with the disk. but let's remember this issue is here in case we come across systems where the OS/disk needs additional configuration to ensure sync does what it says it does.

@aledsage
Copy link
Contributor

LGTM - merging now.

@asfgit asfgit merged commit c5cbc08 into apache:master Sep 11, 2017
asfgit pushed a commit that referenced this pull request Sep 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants