(WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created #361

n3nash · 2018-03-22T04:07:51Z

Current algorithm
inputstream.seek() one byte at a time, hence sliding the read window one byte at a time. This requires performing seek(..) and readFully(..) for every byte over the FSDataInputStream.
New Algorithm
inputstream.readFully(some_sample_block_size) and use the sliding window approach to find the magic header in memory byte array rather than over the inputstream.

Test Results
Earlier algorithm takes approximately 1-2ms per byte. This involves seek(..) and readFully(..) of bytes to compare to magic header. A corrupt block of the size of 256 MB takes ~256000000 ms to find the next data block. This equates to around ~71 hours, hence in situations where there is a corrupt block written, the job takes forever.

The new algorithm reads the entire 256MB worth of bytes in memory using readFully(..) and then slides over the byte array to find the next data block. ** This completes in < 5 secs **

This happened during a performance test and an example log is below :

18/03/16 21:34:58 INFO collection.DiskBasedMap: Spilling to file location ...
18/03/16 21:34:58 INFO log.HoodieCompactedLogRecordScanner: Scanning log file HoodieLogFile {some.log.1}
18/03/16 21:34:58 INFO log.HoodieLogFileReader: Log HoodieLogFile {somelog.1} has a corrupted block at 14
18/03/16 23:00:55 ERROR executor.CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver disassociated! Shutting down.

Notice that the task ran for about 1.5 hrs before it timed out.

n3nash · 2018-03-22T05:41:40Z

@vinothchandar Please take a pass at this PR too tomorrow when you look over the other 2.

vinothchandar

Need to spend more time on reviewing this.

vinothchandar · 2018-03-23T05:06:38Z

hoodie-cli/src/main/java/com/uber/hoodie/cli/commands/HoodieLogFileCommand.java

+            .convert(SchemaUtil
+                .readSchemaFromLogFile(HoodieCLI.tableMetadata.getFs(), new Path(logFilePath)));
+      } catch(NullPointerException e) {
+        // unable to read schema


Can we handle this more directly using null checks as needed.. the empty catch block for NPE, could be cleaner?

Yeah, just did that to fix it quickly, cleaned it now.

vinothchandar · 2018-03-23T05:08:42Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -49,6 +48,7 @@
 class HoodieLogFileReader implements HoodieLogFormat.Reader {

  private static final int DEFAULT_BUFFER_SIZE = 4096;
+  private static final int DEFAULT_LOG_BLOCK_SIZE = 256*1024*1024;


should the default be this high? this will add to the memory pressure also correct?

The 256MB is just read once as a byte [] and then discarded, should be ok I think.

vinothchandar · 2018-03-23T05:17:19Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    boolean done = true;
+    // read upto logblocksize to find next magic header
+    do {
+      corruptedBytes = new byte[logBlockSize];


instead of handling this manually, should we just wrap this into a BufferedInputStream?

https://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html

I see this is a DataInputStream.. still , is there some standard class that can give you the buffering for free

yeah, that's the reason I didn't use Buffered but since you mentioned I looked at it again and realized that DataInputStream actually extends FilterInputStream! So, I was able to change this :)

vinothchandar · 2018-03-23T05:21:00Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+      } catch (EOFException e) {
+        // in case unable to read logBlockSize worth of bytes
+        inputStream.seek(currentPos);
+        numberOfBytesRead = inputStream.available();


why not use available() to estimate actual bytes to begin with, instead of course correcting in the catch block?

The reason to do this was because of the contract of available(..) mentioned here https://docs.oracle.com/javase/7/docs/api/java/io/FilterInputStream.html#available() and how we want to read minimum of default size. But I changed this using the BufferedInputStream so probably no need to discuss further..

…block needs to be created

n3nash · 2018-03-25T18:13:39Z

@vinothchandar addressed your comments, please take anther pass.

n3nash · 2018-03-27T16:34:11Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    byte[] corruptedBytes;
+    int corruptedBlockSize = 0;
+    boolean done = true;
+    BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);


We can wrap this whole block in a CustomBufferedReader sort of implementation and just return the corruptedBytesSize on termination for cleanliness...

Not sure if I fully follow.. But it would be good to have a Buffered, seekable abstraction over the log file.. Corrupt block handling etc should be left at this level IMO

lets file a task for this?

vinothchandar · 2018-03-29T06:22:33Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFormat.java

@@ -97,7 +97,8 @@
    private final static Logger log = LogManager.getLogger(WriterBuilder.class);
    // Default max log file size 512 MB
    public static final long DEFAULT_SIZE_THRESHOLD = 512 * 1024 * 1024L;
-
+    // Default max log block size 512 MB
+    public static final int DEFAULT_LOG_BLOCK_SIZE_THRESHOLD = 256 * 1024 * 1024;


actual value seems like 256MB

Comment still says 512MB.. and also is 256 too high? What about additional memory needs

vinothchandar · 2018-03-29T17:09:55Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -125,6 +128,7 @@ private HoodieLogBlock readBlock() throws IOException {

        // 2 Read the total size of the block
        blocksize = inputStream.readInt();
+        this.logBlockSize = blocksize;


Can we make the logBlockSize part of the LogBlock itself?

vinothchandar · 2018-03-29T17:18:27Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    do {
+      // read upto logblocksize to find next magic header
+      corruptedBytes = new byte[logBlockSize];
+      int numberOfBytesRead = bufferedInputStream.read(corruptedBytes, 0, logBlockSize);


if you are buffering already, why do we have to issue such a large read? If you iterator byte-by-byte like before on a buffered reader, do you still have teh same issue?

I know there probably is n't seek() to go back and forth.. https://docs.oracle.com/javase/7/docs/api/java/io/BufferedInputStream.html#mark(int) mark should be helpful there to rewind, no?

vinothchandar · 2018-03-29T17:20:15Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+        done = false;
+      }
+      corruptedBlockSize += estimatedCorruptBlockSize;
+    } while (!done);


so we return 1 corrupt block if there are back-back corrupt blocks?

Yes, ideally there shouldn't be back to back corrupt blocks but if it failed to write the magic header for the next block too..

vinothchandar · 2018-03-29T17:23:50Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

-  private long scanForNextAvailableBlockOffset() throws IOException {
-    while (true) {
-      long currentPos = inputStream.getPos();
+  private long scanForNextAvailableBlockOffset(byte[] bytes, int numberOfByesRead)


typo: numberOfBytesRead

vinothchandar · 2018-03-29T17:26:41Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+      int numberOfBytesRead = bufferedInputStream.read(corruptedBytes, 0, logBlockSize);
+      int estimatedCorruptBlockSize = (int) scanForNextAvailableBlockOffset(corruptedBytes,
+          numberOfBytesRead);
+      if (numberOfBytesRead == logBlockSize && estimatedCorruptBlockSize == numberOfBytesRead) {


This seems to be checking : we were able to read block size worth of byte and the byte after that matched up to a next available block offset?

vinothchandar · 2018-03-29T17:29:01Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

        // No luck - advance and try again
-        inputStream.seek(currentPos + 1);


would BufferedInputStream with mark and reset have helped resolve this issue differently?

vinothchandar · 2018-03-29T17:31:24Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

      return false;
    } catch (EOFException e) {
      // We have reached the EOF
      return true;
    }
  }

+  private int readMagic(byte[] bytes, int offset) throws IOException {


javadocs on what this method is returning

vinothchandar · 2018-03-29T17:34:49Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    byte[] corruptedBytes;
+    int corruptedBlockSize = 0;
+    boolean done = true;
+    BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);


Not sure if I fully follow.. But it would be good to have a Buffered, seekable abstraction over the log file.. Corrupt block handling etc should be left at this level IMO

bvaradar

I looked atFSDataInputStream (and underlying DFSInputStream) to understand why you are seeing very high latency per byte (amortized). The culprit is the backwards seek we are doing in readMagic and scanForNextAvailableBlockOffset. The rewind implementation in DFSInputStream results in clearing up the current block reader (HDFS) and the next read() (1 byte) causes the HDFS data-block to be fetched.
DFSInputStream does the correct buffering for sequential reading though (reading blocks by blocks).

So Lessons for us:
Dont use naked DFSInputStream or FSDataInputStream (created by DistributedFileSystem.open()). Change HoodieWrapperFileSystem.open() to make sure FSDataInputStream is wrapping a BufferedFSInputStream which in turn is wrapping DFSInputStream. This way we can (a) buffer and (b) avoid refetch when rewinding by few bytes.
(or)
Change scan logic in HoodieLogReader to not rewind the input-stream but to keep moving forward.

Also, It looks like there are read-statistics maintained by DFSInputStream. We can track them to debug latency hits to correlate.

As it is good to fix the root-cause, can we change the logic to do buffering as default for all FileSystem.open() calls instead of turning on buffering for only corrupt blocks. This way, we will not encounter similar issues if we introduce new logic that requires rewinding file pointers.

bvaradar · 2018-03-28T22:38:04Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -125,6 +128,7 @@ private HoodieLogBlock readBlock() throws IOException {

        // 2 Read the total size of the block
        blocksize = inputStream.readInt();
+        this.logBlockSize = blocksize;


So, this.logBlockSize will be set to the block size of the last uncorrupted block. right ?
In that case, createCorruptedBlock() will be using this buffer size while reading the corrupted block. Is this a heuristic to guess how much data you want to read-ahead ?

Yes, you are right and this is the DEFAULT_LOG_BLOCK_SIZE.

bvaradar · 2018-03-29T15:42:57Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -61,6 +62,7 @@
  private long reverseLogFilePosition;
  private long lastReverseLogFilePosition;
  private boolean reverseReader;
+  private int logBlockSize;

  HoodieLogFileReader(FileSystem fs, HoodieLogFile logFile, Schema readerSchema, int bufferSize,
                        boolean readBlockLazily, boolean reverseReader) throws IOException {


Important Note: The buffer size passed here is never used when constructing DFSInputStream.

It's used in the next line ?

No, I meant HDFS client implementation (DFSClient.open() called via FS.open()) which is supposed to be using the bufferSize drops it. The implementation uses HDFS block read from network as implicit buffer.

bvaradar · 2018-03-29T22:34:57Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -125,6 +128,7 @@ private HoodieLogBlock readBlock() throws IOException {

        // 2 Read the total size of the block
        blocksize = inputStream.readInt();
+        this.logBlockSize = blocksize;


Not related to this change but important to fix: In line 137, We are doing catch-all exceptions and treating as corrupt block. The underlying inputStream could throw IOException (other than EOF exception) because of any transient failures (or like stream getting closed) and these doesn't mean it was because of corrupt block. These IOException cases needs to raised to the caller instead of returning a corrupt block.

Good point, fixed it.

n3nash · 2018-04-03T05:52:05Z

@bvaradar thanks for digging into the root-cause and great analysis! I looked at the code too and you're right, the rewind implementation in DFSInputStream results in clearing up the current block reader. I made some local code changes and tried using the BufferedFSInputStream and it indeed works much better but still not as performant as the prefetching and buffering implemented in this PR ( I'm wondering what might be the difference but haven't dug into it).
I think the question is the the memory implications of using the BufferedFSInputStream and how large should we choose the buffer size to be. In my tests I chose it to be 256MB (that's the HDFS block size). @vinothchandar ^ WDYT ?

To help understand my changes, here is the PR for those temporary : https://github.com/uber/hudi/pull/373/files

bvaradar · 2018-04-04T03:20:16Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);
+    do {
+      // read upto logblocksize to find next magic header
+      corruptedBytes = new byte[logBlockSize];


What if the MAGIC string crosses the corruptedBytes boundary (meaning it appears partially at the end of the first read-block and partly at the begining of the next block). Are you taking care of that ?

Good point.. @n3nash can you confirm? It also highlights that we byte stream abstraction is still easier to reason with.

bvaradar · 2018-04-04T03:26:46Z

@n3nash: Regarding the performance difference, I see that you have reduced the number of readFully() calls from 2 to 1 for each round when dealing with OLD_MAGIC vs NEW_MAGIC. Could this explain the perf difference you are seeing ?

vinothchandar · 2018-04-04T12:10:30Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

@@ -125,6 +128,7 @@ private HoodieLogBlock readBlock() throws IOException {

        // 2 Read the total size of the block
        blocksize = inputStream.readInt();
+        this.logBlockSize = blocksize;


vinothchandar · 2018-04-04T12:11:07Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    byte[] corruptedBytes;
+    int corruptedBlockSize = 0;
+    boolean done = true;
+    BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);


lets file a task for this?

vinothchandar · 2018-04-04T12:13:12Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream);
+    do {
+      // read upto logblocksize to find next magic header
+      corruptedBytes = new byte[logBlockSize];


Good point.. @n3nash can you confirm? It also highlights that we byte stream abstraction is still easier to reason with.

vinothchandar · 2018-04-04T12:13:24Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

+    do {
+      // read upto logblocksize to find next magic header
+      corruptedBytes = new byte[logBlockSize];
+      int numberOfBytesRead = bufferedInputStream.read(corruptedBytes, 0, logBlockSize);


vinothchandar · 2018-04-04T12:13:53Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFileReader.java

      return false;
    } catch (EOFException e) {
      // We have reached the EOF
      return true;
    }
  }

+  private int readMagic(byte[] bytes, int offset) throws IOException {


vinothchandar · 2018-04-04T12:14:26Z

hoodie-common/src/main/java/com/uber/hoodie/common/table/log/HoodieLogFormat.java

@@ -97,7 +97,8 @@
    private final static Logger log = LogManager.getLogger(WriterBuilder.class);
    // Default max log file size 512 MB
    public static final long DEFAULT_SIZE_THRESHOLD = 512 * 1024 * 1024L;
-
+    // Default max log block size 512 MB
+    public static final int DEFAULT_LOG_BLOCK_SIZE_THRESHOLD = 256 * 1024 * 1024;


Comment still says 512MB.. and also is 256 too high? What about additional memory needs

n3nash · 2018-04-04T16:25:30Z

@vinothchandar Not sure if you missed to look at my comment earlier. Please look at my comment above of using BufferedFSInputStream or not, based on that I will make changes in this PR or make a PR similar to this #373

vinothchandar · 2018-04-04T17:50:41Z

yeah I think I did miss that. apologies.

I am in favor of using a cleaner byte stream abstraction. We can have the buffer size be configurable? I am bit concerned about having this large a buffer for RecordReader and also theoretically, a few MB should be enough to amortize the seek costs (the main issue of this PR)

n3nash · 2018-04-05T06:16:19Z

Moved the PR to #373. Let's discuss there.

vinothchandar · 2018-04-17T22:33:50Z

Closing this since #373 is in more final shape

n3nash changed the title ~~Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created~~ (WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created Mar 22, 2018

n3nash force-pushed the corrupt_block_optimizations branch from ddd5719 to 408e327 Compare March 22, 2018 05:39

n3nash force-pushed the corrupt_block_optimizations branch from 408e327 to 9c49a9f Compare March 22, 2018 05:44

n3nash changed the title ~~(WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created~~ Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created Mar 22, 2018

vinothchandar requested a review from bvaradar March 23, 2018 05:03

vinothchandar reviewed Mar 23, 2018

View reviewed changes

n3nash force-pushed the corrupt_block_optimizations branch from 9c49a9f to 650fc3d Compare March 24, 2018 00:15

Optimizing search for start and end of corrupt blocks when a corrupt …

0c4c0a4

…block needs to be created

n3nash force-pushed the corrupt_block_optimizations branch from 650fc3d to 0c4c0a4 Compare March 24, 2018 04:18

n3nash commented Mar 27, 2018

View reviewed changes

vinothchandar requested changes Mar 29, 2018

View reviewed changes

bvaradar requested changes Mar 29, 2018

View reviewed changes

bvaradar reviewed Mar 29, 2018

View reviewed changes

bvaradar reviewed Apr 4, 2018

View reviewed changes

vinothchandar requested changes Apr 4, 2018

View reviewed changes

n3nash changed the title ~~Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created~~ (WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created Apr 5, 2018

vinothchandar closed this Apr 17, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

[ONHS-10270] Upgrade to version release-v0.23.1 (apache#361)

ca41aea

		// No luck - advance and try again
		inputStream.seek(currentPos + 1);

(WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created #361

(WIP) Optimizing search for start and end of corrupt blocks when a corrupt block needs to be created #361

Conversation

n3nash commented Mar 22, 2018 • edited

n3nash commented Mar 22, 2018

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash commented Mar 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bvaradar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash commented Apr 3, 2018 • edited

bvaradar Apr 4, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bvaradar commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n3nash commented Apr 4, 2018 • edited

vinothchandar commented Apr 4, 2018

n3nash commented Apr 5, 2018

vinothchandar commented Apr 17, 2018

n3nash commented Mar 22, 2018 •

edited

n3nash commented Apr 3, 2018 •

edited

bvaradar Apr 4, 2018 •

edited

n3nash commented Apr 4, 2018 •

edited