Skip to content

Conversation

@neykov
Copy link
Member

@neykov neykov commented Nov 21, 2016

File access locks for exactly 30 seconds in Virtualbox.

FileBasedObjectStoreTest.assertFilePermission600(file);
}

// Fails ~3 times on 5000 runs in Virtualbox (Ubuntu Xenial).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth a comment here that the test is intended to illustrate the problem, which is now addressed by the change in RebindTestUtils, so though this is @broken it isn't a test that someone needs to fix?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it would certainly be nice to have an explanation of why it's happening. I am still not ready to attribute it to VB flakiness.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth cross-referencing the comment you added in RebindTestUtils.TIMEOUT.

Other than that, LGTM and good to merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment a couple of line below.

File access locks for exactly 30 seconds in Virtualbox.
@neykov neykov force-pushed the fix/persistence-test-failure branch from a9fb2e9 to c56e203 Compare November 22, 2016 12:53
@asfgit asfgit merged commit c56e203 into apache:master Nov 22, 2016
asfgit pushed a commit that referenced this pull request Nov 22, 2016
@neykov neykov deleted the fix/persistence-test-failure branch November 22, 2016 16:49
@neykov
Copy link
Member Author

neykov commented Nov 22, 2016

Did some more tests. Works fine on Amazon. Works fine in KVM. Works fine in Virtualbox on an ext4 fs backed by a ramdisk.
When tracing with strace the blockage looks like:

[pid  6215] 16:16:11.362069 <... close resumed> ) = 0 <0.000125>
[pid  6214] 16:16:11.362093 <... close resumed> ) = 0 <0.000139>
[pid  6215] 16:16:11.362385 rename("/tmp/FileWriteSimpleTest$FileWriterTask.415358743301600237.txt", "/tmp/FileWriteSimpleTest$FileWriterTask.415358743301600237.txt.renamed" <unfinished ...>
[pid  6215] 16:16:11.362454 <... rename resumed> ) = 0 <0.000060>
[pid  6215] 16:16:11.362520 unlink("/tmp/FileWriteSimpleTest$FileWriterTask.415358743301600237.txt.renamed" <unfinished ...>
[pid  6214] 16:16:11.362579 rename("/tmp/FileWriteSimpleTest$FileWriterTask.4885590662355799208.txt", "/tmp/FileWriteSimpleTest$FileWriterTask.4885590662355799208.txt.renamed") = 0 <0.000015>
[pid  6214] 16:16:11.362622 unlink("/tmp/FileWriteSimpleTest$FileWriterTask.4885590662355799208.txt.renamed" <unfinished ...>
[pid  6215] 16:16:11.363172 <... unlink resumed> ) = 0 <0.000646>
[pid  6214] 16:16:42.128922 <... unlink resumed> ) = 0 <30.766277>

It means that the rename call is stuck in kernel land for 30 seconds!. That's confirmed by trying to CTRL-C the application which doesn't do anything.

Given that I'm not able to reproduce in another environment I'll attribute it to Virtualbox bugs. But why on rename/delete and not on other disk activity?
I'm testing with latest Virtualbox, xenial image, additions.

I followed this SO thread and timed my block device activity. The results are:

            ALL           MIN           AVG           MAX           N
--------------- ------------- ------------- ------------- -----------

Q2Q               0.000000004   0.015033531  14.832239223        3091
Q2G               0.000000305   0.000001230   0.000125990        6132
G2I               0.000000250   0.010199140  31.276769918        6134
Q2M               0.000000124   0.000000225   0.000001158          52
I2D               0.000000281   0.000000682   0.000038120        6134
M2D               0.000002187   0.000005093   0.000009806          52
D2C               0.000033907   0.000089936   0.011638723        3092
Q2C               0.000036494   0.010208551  31.288416282        3092

Given that the the 30 second delay is clearly visible in the above table its obvious that it's the disk device causing the delays and not some lock in the kernel. I'll leave it at that (though there's the nagging feeling that Ubuntu Wily is working just fine :) ).

@neykov
Copy link
Member Author

neykov commented Nov 22, 2016

For the record I'm adding the test case here. It's the plain java equivalent of the merged test.


import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.Collection;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;

public class FileWriteSimpleTest {
    private static final int THREAD_COUNT = 2;
    private static final long FILE_OPERATION_TIMEOUT_MS = 20000;
    private static final String TEST_FILE_CONTENT = generateContent();
    private static final ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);

    private static class FileWriterTask implements Callable<FileWriterTask> {
        @Override
        public FileWriterTask call() throws Exception {
            try {
                File tmp = File.createTempFile(getClass().getName() + ".", ".txt");
                try(Writer out = new FileWriter(tmp)) {
                    out.write(TEST_FILE_CONTENT);
                }
                File dst = new File(tmp.getParentFile(), tmp.getName() + ".renamed");
                if (!tmp.renameTo(dst)) {
                    throw new IOException("Couldn't rename file " + tmp + " to " + dst);
                }
                if (!dst.delete()) {
                    throw new IOException("Couldn't delete file " + tmp);
                }
                return this;
            } catch (IOException e) {
                throw new IllegalStateException(e);
            }
        }
    };
    
    public void testSimpleOperationsDelay() throws Exception {
        Collection<Future<FileWriterTask>> tasks = new ArrayList<>(THREAD_COUNT);
        for (int i = 0; i < THREAD_COUNT; i++) {
            tasks.add(executor.submit(new FileWriterTask()));
        }
        long timeStart = System.currentTimeMillis();
        for (Future<FileWriterTask> task : tasks) {
            task.get();
        }
        long timeEnd = System.currentTimeMillis();
        long timeSpent = timeEnd - timeStart;
        if (timeSpent > FILE_OPERATION_TIMEOUT_MS) {
            throw new IllegalStateException("Timed out waiting for tasks - " + timeSpent);
        }
    }
    
    public void testSimpleOperationsDelayManyTimes() throws Exception {
        try {
            for (int i = 1; i <= 5000; i++) {
                if (i % 100 == 0) {
                    System.out.println("Iteration " + i);
                }
                testSimpleOperationsDelay();
            }
        } finally {
            executor.shutdown();
        }
    }

    private static String generateContent() {
        final char[] charArray = new char[4096];
        java.util.Arrays.fill(charArray, ' ');
        return new String(charArray);
    }

    public static void main(String[] args) throws Exception {
        new FileWriteSimpleTest().testSimpleOperationsDelayManyTimes();
    }

}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants