Skip to content

Commit

Permalink
[SPARK-21113][CORE] Read ahead input stream to amortize disk IO cost …
Browse files Browse the repository at this point in the history
Profiling some of our big jobs, we see that around 30% of the time is being spent in reading the spill files from disk. In order to amortize the disk IO cost, the idea is to implement a read ahead input stream which asynchronously reads ahead from the underlying input stream when specified amount of data has been read from the current buffer. It does it by maintaining two buffer - active buffer and read ahead buffer. The active buffer contains data which should be returned when a read() call is issued. The read-ahead buffer is used to asynchronously read from the underlying input stream and once the active buffer is exhausted, we flip the two buffers so that we can start reading from the read ahead buffer without being blocked in disk I/O.

## How was this patch tested?

Tested by running a job on the cluster and could see up to 8% CPU improvement.

Author: Sital Kedia <skedia@fb.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: Sital Kedia <sitalkedia@users.noreply.github.com>

Closes #18317 from sitalkedia/read_ahead_buffer.
  • Loading branch information
Sital Kedia authored and zsxwing committed Sep 18, 2017
1 parent 7c72662 commit 1e978b1
Show file tree
Hide file tree
Showing 5 changed files with 495 additions and 13 deletions.

0 comments on commit 1e978b1

Please sign in to comment.