Parallelizing parquet write and spark's external read operation. #294

ovj · 2018-01-03T22:01:30Z

For large data inserts we have noticed that time spent in reading records from spark's external sorter is almost comparable to time spent in writing records to parquet. We want to reduce the overall write time by parallelizing read and write operations.

As a part of the PR we are trying to parallelize below operations

reading records from spark's external record reader + pre computing insert value (this saves writer thread's time) (depends a lot on the complexity of the schema).
writing records to parquet file.

With these changes we are able to reduce our final parquet write stage runtime from ~1.1h to ~19min.

@vinothchandar , @n3nash, @jianxu please take a look at it.

FYI @esmioley

ovj · 2018-01-11T21:08:35Z

Spoke offline with Vinoth. Summarizing our discussion here. We want to see if we can do the same for MergeHandle or not. Here are my initial thoughts on this.
Merge Handle does 2 below operations

Read records from Spark and add it to "keyToNewRecords". There is no benefit in parallelizing this operation.
Read records from old parquet file and write it to new parquet file. I think there is some room here to optimize based on how much total time is spent in reading records from old parquet file vs total time spent in writing records to new parquet file. I will measure the timings and update it. If needed I will create another PR to handle MergeHandle use case.

vinothchandar · 2018-01-12T06:53:58Z

hoodie-client/src/main/java/com/uber/hoodie/config/HoodieWriteConfig.java

@@ -46,6 +46,8 @@
  private static final String INSERT_PARALLELISM = "hoodie.insert.shuffle.parallelism";
  private static final String BULKINSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism";
  private static final String UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism";
+  private static final String INSERT_WRITE_BUFFER_LIMIT = "hoodie.insert.write.buffer.limit";


I would just call this hoodie.write.buffer.limit since we intend to eventually add this to MergeHandle as well. Please rename the variables everywhere accordingly.

actually hoodie.write.buffer.limit.bytes

done. Changed it to "hoodie.write.buffer.limit.bytes".

vinothchandar · 2018-01-12T06:55:20Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+public class BufferedIterator<K extends HoodieRecordPayload, T extends HoodieRecord<K>> implements Iterator<T> {
+
+  private static Logger logger = LogManager.getLogger(BufferedIterator.class);
+  private static final int RECORD_SAMPLING_RATE = 64;


1-line docs on what these constants are.

and comments in general for methods where necessary

let me add comments.

vinothchandar · 2018-01-12T06:58:02Z

hoodie-common/pom.xml

@@ -121,6 +121,10 @@
      <artifactId>jackson-core-asl</artifactId>
      <version>1.9.13</version>
    </dependency>
+    <dependency>


hoodie-common cannot depend on spark.. this is also pulled in by hoodie-hadoop-mr.. Please remove this dependency.

removed it. Earlier we had created new HoodieSparkTaskContext to pass on TaskContext information to newly launched thread (as TaskContext.get() is threadlocal). But now we have figured out another way to set TaskContext which will not need any of these changes. "TaskContext$.MODULE$.setTaskContext(taskContext);"

vinothchandar · 2018-01-12T07:01:18Z

hoodie-common/src/main/java/com/uber/hoodie/common/spark/HoodieSparkTaskContext.java

+import org.apache.spark.TaskContext;
+import scala.Serializable;
+
+public class HoodieSparkTaskContext implements Serializable {


don't understand the motivation for this class.. https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/TaskContext.html seems serializable and all this is doing is picking out four variables out of the TaskContext class. Can we remove this.

Also like I mentioned this class should not reside inside hoodie-common anyways

removed this class. setting it via "TaskContext$.MODULE$.setTaskContext(taskContext);"

vinothchandar · 2018-01-12T07:03:10Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+ * every {@link #RECORD_SAMPLING_RATE}th record and adjusts number of records in buffer accordingly. This is done to
+ * ensure that we don't OOM.
+ */
+public class BufferedIterator<K extends HoodieRecordPayload, T extends HoodieRecord<K>> implements Iterator<T> {


Can you add an unit test for BufferedIterator?

let me add one.

vinothchandar · 2018-01-12T07:04:38Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  private final AtomicBoolean isDone = new AtomicBoolean(false);
+  private T nextRecord;
+
+  private final AtomicLong readCounter = new AtomicLong(0);


why the intermittent newlines between variables? Dont see any obvious grouping of variables either. can we fix this?

sure let me fix it.

vinothchandar · 2018-01-12T07:07:39Z

hoodie-client/src/main/java/com/uber/hoodie/func/LazyInsertIterable.java

  @Override
  protected List<WriteStatus> computeNext() {
-    List<WriteStatus> statuses = new ArrayList<>();
+    final HoodieSparkTaskContext hoodieSparkTaskContext = HoodieSparkTaskContext.createNewHoodieSparkTaskContext();


1 comment atleast every 10 lines, is a good rule of thumb IMO

vinothchandar · 2018-01-12T07:12:30Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+
+  private final AtomicLong samplingRecordCounter = new AtomicLong(-1);
+
+  public BufferedIterator(final Iterator<T> iterator, HoodieWriteConfig config) {


pass in the just the value for the buffer limit instead of the whole HoodieWriteConfig object? A class named BufferedIterator having HoodieWriteConfig passed in seems like something to avoid IMO .

along with bufferLimit I also need schema. will pass both in then.

vinothchandar · 2018-01-12T07:12:54Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+public class BufferedIterator<K extends HoodieRecordPayload, T extends HoodieRecord<K>> implements Iterator<T> {
+
+  private static Logger logger = LogManager.getLogger(BufferedIterator.class);
+  private static final int RECORD_SAMPLING_RATE = 64;


and comments in general for methods where necessary

vinothchandar · 2018-01-12T07:17:25Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+    this.schema = HoodieIOHandle.createHoodieWriteSchema(config);
+  }
+
+  private void adjustBufferSize(final T record) throws InterruptedException {


rename -> adjustBufferSizeIfNeeded

vinothchandar · 2018-01-12T07:26:39Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  }
+
+  private void readNextRecord() {
+    if (readCounter.incrementAndGet() % 1024 == 0) {


pull 1024 into a constant above?

vinothchandar · 2018-01-12T07:30:35Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+    }
+  }
+
+  public void startRecordReader() {


please rename to something like startBuffering(), current name makes it sound like you are starting another thread

makes sense. done.

vinothchandar · 2018-01-12T07:32:01Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+
+  @Override
+  public T next() {
+    if (this.nextRecord == null && !this.isDone.get()) {


can you just call hasNext() here, instead of repeating code?

vinothchandar · 2018-01-12T08:14:09Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  private final long bufferMemoryLimit;
+  private final Schema schema;
+
+  private long avgRecordSizeOfSampledRecords = 0;


rename: avgSampleSizeBytes

vinothchandar · 2018-01-12T08:14:21Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  private final Schema schema;
+
+  private long avgRecordSizeOfSampledRecords = 0;
+  private long numOfSampledRecords = 0;


rename: numSamples

vinothchandar · 2018-01-12T08:14:50Z

hoodie-client/src/main/java/com/uber/hoodie/config/HoodieWriteConfig.java

@@ -46,6 +46,8 @@
  private static final String INSERT_PARALLELISM = "hoodie.insert.shuffle.parallelism";
  private static final String BULKINSERT_PARALLELISM = "hoodie.bulkinsert.shuffle.parallelism";
  private static final String UPSERT_PARALLELISM = "hoodie.upsert.shuffle.parallelism";
+  private static final String INSERT_WRITE_BUFFER_LIMIT = "hoodie.insert.write.buffer.limit";


actually hoodie.write.buffer.limit.bytes

vinothchandar · 2018-01-12T08:18:06Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+    while (true) {
+      try {
+        throwExceptionIfFailed();
+        newRecord = buffer.poll(5, TimeUnit.SECONDS);


pull 5 into a constant above.

vinothchandar · 2018-01-12T08:25:37Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  /**
+   * We will be using record size to determine how many records we should cache and will change permits accordingly.
+   */
+  private final Semaphore rateLimiter = new Semaphore(1);


use of a Semaphore just to track the buffer size seems an overkill to me.. Can we just use an AtomicLong named currentBufferSize?

It is used for throttling too. AtomicLong here will not be sufficient.

vinothchandar · 2018-01-12T08:29:54Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+
+  private void insertRecord(T t) throws Exception {
+    adjustBufferSize(t);
+    rateLimiter.acquire();


is the permit record based or byte based? in adjustBufferSize, we seem to be acquiring/releasing as much permits as the buffer needs to shrink/grow in bytes.. but here, in insertRecord and as well as in readNextRecord, each permit seems to imply a record?

Side note: if we are just sampling, what value do we use to reduce the buffer size by, once we picked an item off.?

permit is based on record count (and not on bytes). in "adjustBufferSizeIfNeeded" we try to acquire/release multiple permits based on whether new limit has increased or decreased.

vinothchandar · 2018-01-12T08:33:59Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+  private long numOfSampledRecords = 0;
+  private final Iterator<T> internalIterator;
+
+  private final AtomicReference<Exception> hasFailed = new AtomicReference(null);


you can simply use volatile if your intention is just to publish the exception from the other thread here quickly.

volatile may not work here. My intention here is to capture root cause of exception. (other one most likely will be InterruptedException).

vinothchandar · 2018-01-12T08:39:06Z

hoodie-client/src/main/java/com/uber/hoodie/func/LazyInsertIterable.java

+          } catch (Exception e) {
+            logger.error("error writing hoodie records", e);
+            if (writeError.compareAndSet(null, e)) {
+              bufferedIterator.markAsFailed(e);


tbh not a big fan of by hand threads and error notification thread-thread.. Can we just use ExecutorService an https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html if needed, to fire off two futures and cancel one if the other fails..

the need for semaphore here should go away..

vinothchandar · 2018-01-12T08:40:09Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+    this.rateLimiter.release(RECORD_CACHING_LIMIT + 1);
+  }
+
+  private static final class BufferedIteratorRecord<T> {


why do we need this class. Can't we just use Optional above?

makes sense. done.

vinothchandar · 2018-01-12T08:43:02Z

hoodie-client/src/main/java/com/uber/hoodie/func/BufferedIterator.java

+    }
+  }
+
+  private static final class BufferedInsertPayload implements HoodieRecordPayload {


same here.. can we avoid this class, by narrowing/adjusting the generic type def in the class ?

sure. I wanted this to offload computing to reader thread. Let me introduce another api in HoodieRecord -> prepareInsertValue(). Let me know what you think.

…k and writing records to parquet file

ovj · 2018-03-14T23:26:06Z

Thanks @vinothchandar. Addressed all your comments. Please take another look at it. I have verified it internally.

…te and read for avro (apache#8764) (apache#294) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>

vinothchandar self-requested a review January 4, 2018 02:27

vinothchandar assigned vinothchandar and ovj and unassigned vinothchandar Jan 4, 2018

vinothchandar requested changes Jan 12, 2018

View reviewed changes

Spawning parallel writer thread to separate reading records from spar…

ccf818e

…k and writing records to parquet file

ovj force-pushed the parallel_writer_thread branch from c7b2d78 to ccf818e Compare March 14, 2018 23:03

vinothchandar merged commit c5b4cb1 into apache:master Mar 15, 2018

vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023

[HUDI-6240] Adding default value as CORRECTED for rebase modes in wri…

709f6f3

…te and read for avro (apache#8764) (apache#294) Co-authored-by: Sivabalan Narayanan <n.siva.b@gmail.com>


		private final AtomicLong samplingRecordCounter = new AtomicLong(-1);

		public BufferedIterator(final Iterator<T> iterator, HoodieWriteConfig config) {

Parallelizing parquet write and spark's external read operation. #294

Parallelizing parquet write and spark's external read operation. #294

Conversation

ovj commented Jan 3, 2018

ovj commented Jan 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ovj Mar 12, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ovj commented Mar 14, 2018

ovj Mar 12, 2018 •

edited