DRILL-6410: Memory leak in Parquet Reader during cancellation #1333

vrozov · 2018-06-25T06:30:35Z

@parthchandra @sachouche @ilooner Please review

sachouche

Looks good; thanks @vrozov for making the changes.

sachouche · 2018-06-26T17:24:22Z

...va-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java

-  private long totalPageValuesRead = 0;
-  private Object pageQueueSyncronize = new Object(); // Object to use to synchronize access to the page Queue.
-                                                     // FindBugs complains if we synchronize on a Concurrent Queue
+  private final ExecutableTasksLatch<AsyncPageReaderTask> executableTasksLatch;


The ExecutableTasksLatch replaces the ExecutorService for concurrent task manipulation:

Is this the recommended way moving forward?

There is no javadoc for the ExecutableTasksLatch; you need to document this new class APIs and execution semantics. This is especially true if we want to encourage the use of this implementation pattern.

ExecutableTasksLatch encapsulates ExecutorService for parallel tasks that require a synchronization point at cancellation. For all other tasks ExecutorService can be used directly.

I'll add more documentation after the first round of review.

The documentation would be useful for the reviewer too. It is easier to check the code if one knows what what guarantees ExecutableTasksLatch is supposed to provide.

Note that the class comment for AsyncPageReader needs to be updated too.

IMO, most of the time, the source code should document itself and java documentation is necessary for an API distributed as an already compiled library/jar.

I would prefer reviewers to point to obscure code rather than to rely on the documentation.

Waiting for a review comments helps to avoid inconsistency between the code and the documentation (quite a common problem) as usually code evolves faster and documentation lags behind.

I already added documentation for ExecutableTasksLatch and ExecutableTask.

I'll change comments for AsyncPageReader during rebase and merge conflict resolution.

I'm afraid I'm going to have to insist.

Please be more specific. Do you refer to AsyncPageReader doc? It was already changed during the merge conflict resolution.

priteshm · 2018-06-30T00:16:50Z

@ilooner can you also take a look at it?

ilooner · 2018-06-30T01:56:03Z

@priteshm will take a look monday.
@vrozov please fix conflict and travis failures.

vrozov · 2018-06-30T02:13:00Z

@ilooner I'll rebase after review. Travis CI failure is not related to the change, it failed due to the build exceeding Travis CI limit.

ilooner

@vrozov You put me in a tough spot :). After our last discussion I was under the impression that the majority of us were in agreement to use the approach outlined in #1257 with the exception of having a custom ExecutorService which would be removed.

The advantages of that approach were that we could delegate the complexity to the java concurrency library and have minimal maintenance overhead. With your approach we are reinventing the wheel to create our own version of concurrency classes that are already there. Considering how the first changes to PartitionerTask took 4 - 5 days of back and forth to resolve all the race conditions, we might be creating a lot of extra work for ourselves down the road with your approach.

So again you put me in a tough spot because I don't agree with the solution, and based on the discussion previously the majority wanted to move in a different direction. On one hand I have to provide honest feedback in my reviews, but on the other hand I definitely don't want to be the guy blocking a bug fix.

To move this PR forward I suggest finding a committer that agrees with the approach you decided to take, and who is willing to help maintain the code moving forward. They can take the review to completion and help you get this merged :).

vrozov · 2018-07-02T16:34:05Z

@ilooner I guess by "last discussion" you refer to the discussion between you, me and @sachouche, where "majority" does not mean the community majority. In the Apache, any contributor can provide a solution that (s)he considers to be the best solution possible and then it can either be accepted by the community/contributor or blocked with -1 (requires technical justification). If another contributor provides an alternative solution, a community may decide to go with the alternate solution as long as it addresses technical concerns of the initial contribution. For this particular case, my requirements are a) a unified approach (@parthchandra has the same requirement) and b) the ability to cancel tasks asynchronously. If that can be done with the approach outlined in PR #1257 and a contributor will change it to address all the issues, let's move forward with the alternate approach.

A note regarding the complexity of the implementation. This implementation uses public java concurrency classes as well. It does not rely on unsupported or unsafe to use Java classes and/or API. Basically, LockSupport is the same first-class concurrency construct as Thread or CountDownLatch classes. The primary use case for using those constructs is to create a combination of ExecutorService and a CountDownLatch that is not provided by the Java itself.

To summarize, I am perfectly fine to go with an alternate solution or with another committer to review the PR, it will be against Apache way to force a committer to review or commit a change, that (s)he is not comfortable with.

vrozov · 2018-07-11T23:09:29Z

@parthchandra Please review

parthchandra · 2018-07-17T23:00:24Z

...va-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java

-  private long totalPageValuesRead = 0;
-  private Object pageQueueSyncronize = new Object(); // Object to use to synchronize access to the page Queue.
-                                                     // FindBugs complains if we synchronize on a Concurrent Queue
+  private final ExecutableTasksLatch<AsyncPageReaderTask> executableTasksLatch;


The documentation would be useful for the reviewer too. It is easier to check the code if one knows what what guarantees ExecutableTasksLatch is supposed to provide.

parthchandra · 2018-07-17T23:18:18Z

...va-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java

-  private long totalPageValuesRead = 0;
-  private Object pageQueueSyncronize = new Object(); // Object to use to synchronize access to the page Queue.
-                                                     // FindBugs complains if we synchronize on a Concurrent Queue
+  private final ExecutableTasksLatch<AsyncPageReaderTask> executableTasksLatch;


Note that the class comment for AsyncPageReader needs to be updated too.

parthchandra · 2018-07-18T18:23:40Z

exec/java-exec/src/main/java/org/apache/drill/exec/util/ExecutableTasksLatch.java

+   */
+  public ExecutableTask<C> execute(C callable) {
+    ExecutableTasksLatch.ExecutableTask<C> task = new ExecutableTasksLatch.ExecutableTask<>(callable, this);
+    executor.execute(task);


Call Executor.submit and save the Future? That saves you with all the checking you are doing in take() which is a simplified version of Future.get()? Note that the Future is likely to be faster and more tested than this?

I doubt that Future would be faster, but it is definitely more tested. Please see my comment why Future and FutureTask can't be used.

parthchandra · 2018-07-18T22:57:26Z

...va-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java

-        // Do nothing.
-      }
-    }
+    executableTasksLatch.await(() -> true);


Isn't the original code doing the same thing that await is doing? TBH, I'd really like to understand where the memory leak was occurring. (Just trying to understand how this PR, does, in fact, fix the issue)

No, ExecutableTasksLatch.await() guarantees that when it returns all tasks submitted for execution are either done or canceled. Future.cancel() does not wait for the FutureTask to be canceled as it merely interrupts the thread where FutureTask is running (in the case it is already running). Note that after Future is canceled it is not possible to check whether it is finished or not (Future.get throws CancellationException).

Missing guarantee that all tasks are finished when clear() returns, so some tasks continue to run and reference vectors, so allocator reports a memory leak.

Thanks for clarifying. Makes sense.

Seems to me that we could have extended FutureTask to provide this guarantee. Perhaps that is what @ilooner would have liked too.
See this : https://stackoverflow.com/questions/6040962/wait-for-cancel-on-futuretask

I know and saw that post before the fix for the same problem was implemented only for PartitionerDecorator. I decided to go with a similar approach outlined in the post, but that does not use FutureTaskto support additional functionality that I outlined in my response to @ilooner. This implementation supports asynchronous cancellation (call to cancel() does not block waiting for a task to complete allowing a faster cancellation) that the solution that extends FutureTask does not provide.

parthchandra · 2018-07-18T22:57:33Z

exec/java-exec/src/main/java/org/apache/drill/exec/util/ExecutableTasksLatch.java

+   * Helper class to wrap {@linkplain Callable}{@literal <Void>} with cancellation and waiting for completion support
+   *
+   */
+  public static final class ExecutableTask<C extends Callable<Void>> implements Runnable {


Can you elaborate what this class has that is not already available in FutureTask?

Please see my comment for ExecutableTasksLatch.await().

vrozov · 2018-07-19T18:08:17Z

@parthchandra rebased and resolved conflicts. Please take a look.

parthchandra

+1. I'd like to see a perf test on this before we merge into master.

priteshm · 2018-08-31T17:32:30Z

@vrozov Dechang is running a perf test to confirm the question from @parthchandra. Once that is done, we should merge to the branch. Was there any other reason to close it?

sachouche reviewed Jun 26, 2018

View reviewed changes

vrozov closed this Jun 27, 2018

vrozov reopened this Jun 27, 2018

ilooner reviewed Jun 30, 2018

View reviewed changes

parthchandra reviewed Jul 18, 2018

View reviewed changes

DRILL-6410: Memory leak in Parquet Reader during cancellation

1a21016

parthchandra approved these changes Jul 20, 2018

View reviewed changes

vrozov closed this Aug 31, 2018

DRILL-6410: Memory leak in Parquet Reader during cancellation #1333

DRILL-6410: Memory leak in Parquet Reader during cancellation #1333

Uh oh!

Conversation

vrozov commented Jun 25, 2018

Uh oh!

sachouche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

priteshm commented Jun 30, 2018

Uh oh!

ilooner commented Jun 30, 2018

Uh oh!

vrozov commented Jun 30, 2018

Uh oh!

ilooner left a comment

Choose a reason for hiding this comment

Uh oh!

vrozov commented Jul 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vrozov commented Jul 11, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrozov commented Jul 19, 2018

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

priteshm commented Aug 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vrozov commented Jul 2, 2018 •

edited

Loading